# Keypoint Estimation with FiftyOne

## Who this is for
This tutorial is designed for:
- Computer vision engineers with basic FiftyOne experience ([familiar with Datasets](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/) and [the FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/))
- Intermediate understanding of computer vision and keypoint detection
- Those looking to implement or evaluate keypoint detection models in their workflow

## Assumed Knowledge
### Computer Vision Concepts
- Basic understanding of keypoint detection and pose estimation
- Familiarity with common CV models (YOLO, R-CNN)
- Understanding of confidence scores and model evaluation metrics

### Technical Requirements
- Python programming (intermediate level)
- Basic understanding of PyTorch or similar deep learning frameworks

### FiftyOne Concepts
You should be familiar with:
- [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)
- [The FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/)
- [Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/models/)
- [Plugins](https://beta-docs.voxel51.com/plugins/)

## Time to Complete
Estimated time: 30 minutes

## Required Packages
First, ensure you have a virtual environment with FiftyOne installed. Then install the following packages:

```python
# Install required packages
pip install fiftyone
pip install ultralytics
pip install torch torchvision
pip install transformers
pip install Pillow
pip install opencv-python
```

## Content Overview
This tutorial covers:
1. **Dataset Download**: Loading a hand keypoints dataset from Hugging Face
2. **Model Zoo Integration**: Using FiftyOne's built-in Keypoint R-CNN model
3. **Ultralytics Integration**: Implementing YOLO pose estimation
4. **Plugin Usage**: Leveraging the community-contributed ViTPose plugin
5. **Custom Integration**: Implementing an arbitrary keypoint detection model (SuperPoint)
6. **Evaluation**: Assessing keypoint detection performance

# Download Datset

Let's start by downloading a dataset from Voxel51's [Hugging Face org](https://huggingface.co/Voxel51). In this tutorial, we'll use the the [Hands Keypoint](https://huggingface.co/datasets/Voxel51/hand-keypoints) dataset:

In [1]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

keypoint_dataset = load_from_hub("voxel51/hand-keypoints", overwrite=True)

Downloading config file fiftyone.yml from voxel51/hand-keypoints
Loading dataset
Importing samples...
 100% |█████████████████| 846/846 [29.1ms elapsed, 0s remaining, 29.1K samples/s]      


## Using Model Zoo

The FiftyOne [Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/models/) has several models you can use for common tasks, including one for pose estimation, [**Keypoint R-CNN ResNet50 FPN COCO Torch**](https://beta-docs.voxel51.com/models/model_zoo/models/#keypoint-rcnn-resnet50-fpn-coco-torch).

This model is a variant of the R-CNN architecture designed for keypoint detection, utilizing a ResNet50 backbone with Feature Pyramid Network (FPN) enhancements. This model is pre-trained on the COCO dataset and is capable of detecting keypoints for objects, particularly humans, with high accuracy, achieving a keypoint AP of around 61.1% on COCO-val2017.

We'll [load the model from the FiftyOne Model Zoo via `load_zoo_model`](https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#load_zoo_model), and then run inference on our dataset [using the `apply_model` method](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#apply_model) of the [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html) and store the results in a [Field](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/) named `rcnn_keypoints`:

In [2]:
import fiftyone as fo
import fiftyone.zoo as foz

rcnn_pose_model = foz.load_zoo_model("keypoint-rcnn-resnet50-fpn-coco-torch")

keypoint_dataset.apply_model(rcnn_pose_model, label_field="rcnn_keypoints")

 100% |█████████████████| 846/846 [21.1s elapsed, 0s remaining, 38.1 samples/s]      


We can inspect the contents of the [first](https://beta-docs.voxel51.com/tutorials/pandas_comparison/#first-and-last) [Sample](https://beta-docs.voxel51.com/api/fiftyone.core.sample.Sample.html):

In [3]:
keypoint_dataset.first()['rcnn_keypoints_keypoints']

<Keypoints: {
    'keypoints': [
        <Keypoint: {
            'id': '67d9a06b1b6600364313edd9',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'points': [
                [0.4506858825683594, 0.16730889214409722],
                [0.4701820373535156, 0.1366334561948423],
                [0.43118969599405926, 0.13796715912995516],
                [0.4919277826944987, 0.16730889214409722],
                [0.40719436009724935, 0.165975175080476],
                [0.5429178237915039, 0.3540289984809028],
                [0.36445271174112953, 0.3513615360966435],
                [0.5556653340657552, 0.6127696849681713],
                [0.4101937929789225, 0.5860954002097801],
                [0.5114240010579427, 0.7581446329752605],
                [0.4236911137898763, 0.3433592619719329],
                [0.5181727091471354, 0.7514759770146122],
                [0.41169347763061526, 0.7568109017831308],
                [0.657645416259

# Using Ultralytics

FiftyOne has an [integration with Ultralytics](https://beta-docs.voxel51.com/integrations/ultralytics/) which makes it easy for you to use one of their [Keypoint estimation](https://beta-docs.voxel51.com/integrations/ultralytics/#keypoints) models. All you have to do is instantiate an Ultralytics model for keypoint estimation and pass that into the [`apply_model` method](https://beta-docs.voxel51.com/api/fiftyone.core.models.html#apply_model) of your [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html). 

In [None]:
!pip install ultralytics

In [4]:
from ultralytics import YOLO

ul_pose_model = YOLO("yolo11n-pose.pt") 

keypoint_dataset.apply_model(ul_pose_model, label_field="ul_pose")

 100% |█████████████████| 846/846 [23.5s elapsed, 0s remaining, 36.3 samples/s]      


You can inspect the [first Sample](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#first) of your Dataset as before:

In [5]:
keypoint_dataset.first()['ul_pose']

<Keypoints: {
    'keypoints': [
        <Keypoint: {
            'id': '67d9a07f1b660036431403ca',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'points': [
                [0.44621869921684265, 0.16390232741832733],
                [0.4681428074836731, 0.13291257619857788],
                [0.43007001280784607, 0.13300056755542755],
                [0.4991542100906372, 0.16258731484413147],
                [0.4017489552497864, 0.16290490329265594],
                [0.5392971038818359, 0.33804500102996826],
                [0.3559896945953369, 0.3735353946685791],
                [0.567536473274231, 0.6175299882888794],
                [0.3908997178077698, 0.548811674118042],
                [0.513115406036377, 0.7681204676628113],
                [0.4212362468242645, 0.30758777260780334],
                [0.5325959324836731, 0.7181503176689148],
                [0.4006675183773041, 0.7379183173179626],
                [0.554581463336

# Using Plugins

FiftyOne provides a powerful [Plugin framework](https://beta-docs.voxel51.com/plugins/) that allows for extending and customizing the functionality of the tool to suit your specific needs. Check out the [FiftyOne plugins repository](https://github.com/voxel51/fiftyone-plugins) for a growing collection of plugins that you can easily [download](https://beta-docs.voxel51.com/plugins/using_plugins/#plugins-download) and use locally.

One plugin that's been contributed by a community member is the [ViTPose plugin](https://github.com/harpreetsahota204/vitpose-plugin). You can learn more about the plugin by visit it's GitHub repo.

Let's start by setting a require enviornment variable:

In [6]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

Now, [download the plugin](https://beta-docs.voxel51.com/plugins/using_plugins/#downloading-plugins):

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/vitpose-plugin

Ensure that you've [installed all the requirements](https://beta-docs.voxel51.com/plugins/using_plugins/#installing-plugin-requirements) for the plugin:

In [None]:
!fiftyone plugins requirements @harpreetsahota/vitpose --install

This plugin requires that we have some [Metadata about our dataset](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#metadata). To ensure you have the required metadata, use the [`compute_metadata` method](https://beta-docs.voxel51.com/api/fiftyone.core.metadata.html#compute_metadata) of the Dataset:

In [7]:
keypoint_dataset.compute_metadata()

Computing metadata...
 100% |█████████████████| 846/846 [242.9ms elapsed, 0s remaining, 3.5K samples/s]      


You can use the Plugin via the [FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/) or [via the SDK](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators) by instantiating the operator [using `get_operator`](https://beta-docs.voxel51.com/api/fiftyone.operators.registry.html#get_operator) as shown below:

In [None]:
import fiftyone.operators as foo

vitpose_operator = foo.get_operator("@harpreetsahota/vitpose/vitpose_keypoint_estimator")

This model requires that we have bounding boxes. Luckily, we have already obtained these when we applied `keypoint-rcnn-resnet50-fpn-coco-torch` to our Dataset above.

You'll need to start a [Delegated Service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1). To do so, open your terminal and run: `fiftyone delegated launch`

In [None]:
# Run the operator on your dataset
await vitpose_operator(
    keypoint_dataset,
    model_name="usyd-community/vitpose-plus-small",  # Select from one of the supported models
    bbox_field="rcnn_keypoints_detections", # Name of the field where your bounding box detections are stored.
    output_field="vitpose_estimates",  # Name of the field to store the Keypoints in.
    confidence_threshold= 0.55, #Confidence threshold for keypoint detection
    delegate=True
)

If you're running the Operator in a notebook, you will need to [Save your Dataset](https://beta-docs.voxel51.com/faq/#why-didnt-changes-to-my-dataset-save):

In [10]:
keypoint_dataset.save()

You can monitor the progress of the execution in your terminal. You'll see something to the effect of `Operation 67d88b7a7fe8205cb39fcf8b complete`  upon successful execution of the [Operator](https://beta-docs.voxel51.com/api/fiftyone.operators.operator.Operator.html). Like before, you can inspect the first element of your Dataset:

In [11]:
keypoint_dataset.first()['vitpose_estimates']

<Keypoints: {
    'keypoints': [
        <Keypoint: {
            'id': '67d9a0c46b8b76864b1b1714',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'points': [
                [0.44923893610636395, 0.1704833984375],
                [0.4699263572692871, 0.13364246509693287],
                [0.4285169919331869, 0.13664460358796296],
                [0.49417511622111004, 0.1592803955078125],
                [0.40818579991658527, 0.1666087962962963],
                [0.538152567545573, 0.35329295970775465],
                [0.37299680709838867, 0.3453253286856192],
                [0.5570560455322265, 0.6124801070601852],
                [0.4096650759379069, 0.5843711570457176],
                [0.5075816154479981, 0.763127757884838],
                [0.4215219179789225, 0.34310212311921295],
                [0.5223502159118653, 0.7390630651403356],
                [0.4158531506856283, 0.7545021339699074],
                [0.692319615681966

# Arbitrary Keypoint Estimation Model

The following example shows how to run inference on your dataset using an arbitrary keypoint estimation model.

When integrating a Keypoint detection model with FiftyOne, you'll typically follow a two-phase process: first setting up the model, then processing its outputs into FiftyOne format.

#### Setting Up Your Model

Before running inference, you need to configure your model and any preprocessing components. This setup phase typically involves:

1. **Loading Model Weights**: Import your pretrained model from a local path or model hub
2. **Configuring the Model**: Set any specific parameters like confidence thresholds or detection modes
3. **Preparing Preprocessing**: Set up any image transformations required before inference
4. **Device Placement**: Move the model to the appropriate device (CPU/GPU)

This preparation ensures your model is ready to process images efficiently. For a keypoint detection model, the setup might include configuring the number of keypoints to detect, how confidence scores are calculated, and any specific detection thresholds.

Once your model is configured, you can proceed to the inference phase where you'll run the model on each sample and process its outputs into FiftyOne's structure.

In [None]:
import torch
from PIL import Image
import cv2

from transformers import AutoImageProcessor, SuperPointForKeypointDetection

device = "cuda" if torch.cuda.is_available() else "cpu"

superpoint_processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")

superpoint_model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint", device_map=device)

When working with keypoint detection models like SuperPoint, processing their outputs for storage in FiftyOne follows a consistent pattern:

### 1. Process Each Dataset Sample

The workflow starts by [iterating through](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#iter_samples) each [sample in the Dataset](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/). For each image, the model is run to detect keypoints, using the sample's actual dimensions to ensure proper scaling.

### 2. Extract Raw Keypoint Data

The model output typically provides two key pieces of information:
- Keypoint coordinates in pixel space (x, y positions)
- Confidence scores for each detected point

### 3. Normalize Coordinates

FiftyOne requires [Keypoint](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) coordinates to be normalized to [0,1] range rather than pixel coordinates. This normalization makes the keypoints resolution-independent and ensures they work correctly across different image sizes.

### 4. Create Structured Keypoint Objects

FiftyOne's [`Keypoint`](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) structure has an important hierarchical design that affects how we process model outputs:

- A [`Keypoints` object](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoints.html) is a collection container that holds multiple `Keypoint` objects
- Each [`Keypoint` object](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) can represent either a single point or a semantically meaningful group of points

### 5. Choose the Appropriate Representation

The way we organize points depends on their semantic relationships:

**For semantically unrelated points (as in SuperPoint)**:
- Each interest point becomes its own individual [`Keypoint`](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) object
- All these individual [`Keypoint`](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html)  objects are gathered into a single [`Keypoints` object](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoints.html) collection

**For semantically related points (like hand or body poses)**:
- All related joints would be stored as multiple coordinates within a single `Keypoint` object
- The `Keypoint` object would represent the entire structure (e.g., "left hand" or "body")
- We might have multiple such `Keypoint` objects in one `Keypoints` collection (e.g., for multiple people)

### 6. Store in the Dataset

The final step adds this structured data to each sample in the dataset, creating a new field that holds the keypoints collection.

### Key Representation Differences

The distinction between these two approaches is crucial:

1. **Interest Point Models**: Each detected point stands alone functionally, without inherent relationships to other points. We create separate `Keypoint` objects for each independent point.

2. **Structural Keypoint Models (Pose/Hand)**: The points collectively represent a unified structure where relationships between points matter. We'd store all joints of a hand/body as a single `Keypoint` object with multiple coordinate points.

This flexible hierarchical structure allows FiftyOne to appropriately represent both scattered interest points from models like SuperPoint and structured anatomical features from pose estimation models, while maintaining the proper semantic relationships in each case.

In [20]:
# Iterate through each sample in the dataset
for sample in keypoint_dataset.iter_samples(autosave=True):
    # Get image dimensions from metadata
    img_height = sample.metadata.height
    img_width = sample.metadata.width
    
    # Load image from file path
    file_path = sample.filepath
    sample_media = Image.open(file_path)
    
    # Process image through SuperPoint model
    inputs = superpoint_processor(sample_media, return_tensors="pt").to(device, superpoint_model.dtype)  # Prepare inputs for model
    outputs = superpoint_model(**inputs)  # Run inference
    
    # Post-process model outputs to get keypoints in image coordinates
    processed_outputs = superpoint_processor.post_process_keypoint_detection(
        outputs, 
        [(img_height, img_width)]  # Provide original image dimensions for coordinate scaling
    )
    
    # Extract keypoint coordinates and confidence scores
    keypoints = processed_outputs[0]['keypoints'].tolist()  # List of [x,y] coordinates
    scores = processed_outputs[0]['scores'].tolist()  # Confidence score for each keypoint
    
    # Create a list of individual Keypoint objects
    keypoint_objects = []
    for idx, (keypoint, score) in enumerate(zip(keypoints, scores)):
        # Normalize coordinates to [0, 1] range for FiftyOne format
        normalized_x = keypoint[0] / img_width
        normalized_y = keypoint[1] / img_height
        
        # Create individual keypoint object with normalized coordinates
        kp = fo.Keypoint(
            label=f"point_{idx}",  # Unique label for each keypoint
            points=[(normalized_x, normalized_y)],  # Points expects a list of coordinates
            confidence=[score]  # Confidence score from model, expected as a list
        )
        keypoint_objects.append(kp)
    
    # Create Keypoints collection containing all keypoints for this image
    keypoints_collection = fo.Keypoints(keypoints=keypoint_objects)
    
    # Add keypoints to the sample and save
    # sample.set_field("superpoint_keypoints", keypoints_collection, create=True)  # create=True allows new field creation
    sample["superpoint_keypoints"] = keypoints_collection
    # sample.save()  # Persist changes to database

# Reload dataset to ensure changes are reflected in memory
keypoint_dataset.reload()

You can run the line below (excluded here, because there are over 700 Keypoint objects) and notice the difference between how these are parsed and how the outputs for pose estimation models are parsed:

In [None]:
keypoint_dataset.first()['superpoint_keypoints']

Let's inspect the outputs from the various keypoint estimation models. Start by launching the app:

```python
fo.launch_app(keypoint_dataset)
```

<img src ="assets/keypoint-estimation-output.webp" width="70%">


# Evaluate Keypoints

You can use FiftyOne's [Evaluation API](https://beta-docs.voxel51.com/fiftyone_concepts/evaluation/) to evaluate the output of keypoint estimation models.

[Keypoints](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) are treated as [Detections](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detections.html) for the purposes of evaluation in FiftyOne. We can use the [`evaluate_detections`](https://beta-docs.voxel51.com/api/fiftyone.utils.eval.detection.html#evaluate_detections) method of the Dataset to perform evaluation. Note that when evaluating keypoints, “IoUs” are computed via [object keypoint similarity](https://cocodataset.org/#keypoints-eval).

For illustrative purposes we will consider `ul_pose` as the ground truth and `vitpose_estimates` as the predictions.


In [None]:
kp_eval_results = keypoint_dataset.evaluate_detections(
    pred_field="vitpose_estimates",
    gt_field="ul_pose",
    eval_key="ul_vs_vitpose",
    )

The output of [`evaluate_detections`](https://beta-docs.voxel51.com/api/fiftyone.utils.eval.detection.html#evaluate_detections) is an [EvaluationResults](https://beta-docs.voxel51.com/api/fiftyone.core.evaluation.EvaluationResults.html) object. You can use the [`print_report`](https://beta-docs.voxel51.com/api/fiftyone.utils.eval.base.BaseClassificationResults.html#print_report) or [`print_metrics`](https://beta-docs.voxel51.com/api/fiftyone.utils.eval.base.BaseEvaluationResults.html#print_metrics) methods of the [EvaluationResults](https://beta-docs.voxel51.com/api/fiftyone.core.evaluation.EvaluationResults.html) object to see high-level performance:

In [35]:
kp_eval_results.print_report()

              precision    recall  f1-score   support

      person       0.51      0.97      0.67      1477

   micro avg       0.51      0.97      0.67      1477
   macro avg       0.51      0.97      0.67      1477
weighted avg       0.51      0.97      0.67      1477



In [36]:
kp_eval_results.print_metrics()

accuracy   0.5
precision  0.51
recall     0.97
fscore     0.67
support    1477


The evaluation routine also populated some new fields on our dataset that contain helpful information that we can use to evaluate our predictions at the sample-level.

In particular, each sample now contains new fields:

- `ul_vs_vitpose_tp`: the number of true positive (TP) predictions in the sample
- `ul_vs_vitpose_fp`: the number of false positive (FP) predictions in the sample
- `ul_vs_vitpose_fn`: the number of false negative (FN) predictions in the sample

In [41]:
keypoint_dataset

Name:        voxel51/hand-keypoints
Media type:  image
Num samples: 846
Persistent:  False
Tags:        []
Sample fields:
    id:                        fiftyone.core.fields.ObjectIdField
    filepath:                  fiftyone.core.fields.StringField
    tags:                      fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:                  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:                fiftyone.core.fields.DateTimeField
    last_modified_at:          fiftyone.core.fields.DateTimeField
    right_hand:                fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Keypoints)
    body:                      fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Keypoints)
    left_hand:                 fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Keypoints)
    rcnn_keypoints_detections: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.la

You can also obtain information [aggregate information](https://beta-docs.voxel51.com/fiftyone_concepts/using_aggregations/) about the metrics, for example the [upper and lower bounds](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#bounds) of the IOU values:

In [44]:
print(keypoint_dataset.bounds("vitpose_estimates.keypoints.ul_vs_vitpose_iou"))

(0.5010650611528887, 0.9999724480963939)


Or the [count](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#count_values) of true positives and false negatives:

In [54]:
print(keypoint_dataset.count_values("vitpose_estimates.keypoints.ul_vs_vitpose"))

{'tp': 1437, 'fp': 1372}


# Summary

In this tutorial, we've explored several approaches to working with keypoint estimation in FiftyOne:

- Used the Model Zoo's Keypoint R-CNN for out-of-the-box pose estimation
- Integrated Ultralytics' YOLO pose estimation model
- Leveraged community plugins with ViTPose
- Implemented a custom keypoint detection model (SuperPoint)
- Evaluated keypoint detection performance using FiftyOne's evaluation tools

## Key Takeaways
- FiftyOne provides multiple paths for keypoint detection, from pre-built solutions to custom implementations
- The framework's flexible data structures can handle both structured pose keypoints and unstructured interest points
- Built-in evaluation tools make it easy to compare and assess different keypoint detection approaches

## Next Steps
To build upon what you've learned:

- Explore other keypoint detection models in the [Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/models/)

- Try implementing your own custom keypoint detection models

- Check out the [FiftyOne plugins repository](https://github.com/voxel51/fiftyone-plugins) for more community-contributed tools

- Learn more about [evaluating detections](https://beta-docs.voxel51.com/tutorials/evaluate_detections/) in FiftyOne

- Read [this blog post](https://voxel51.com/blog/cotracker3-a-point-tracker-using-real-videos/) about using CoTracker3 with FiftyOne

For questions or to share your implementations, join the [FiftyOne Discord community](https://community.voxel51.com/) and [follow us on LinkedIn](https://www.linkedin.com/company/voxel51/posts/?feedView=all).