# SynthHuman Dataset Analysis in FiftyOne

This notebook demonstrates how to load and analyze the SynthHuman dataset using FiftyOne, including:
- Loading the dataset from Hugging Face
- Computing embeddings and visualizations
- Performing similarity analysis
- Enriching data with AI-generated labels using Qwen2.5VL

## Setup and Installation

First, we'll install the required packages:


In [None]:
!pip install fiftyone hf-xet einops qwen_vl_utils

## Loading the SynthHuman Dataset

Load the SynthHuman dataset from Hugging Face Hub into FiftyOne format.

Note: If you get an error from Hugging Face (like 429, 5xx, etc) then just rerun the cell and pick up from where it left off.

Alternatively, you can just pass `max_samples=<some number` to the `load_from_hub` function.

Note, there's some issue with downloading the 3D assets from Hugging Face. We're working on it. You can also follow the instructions to download and render the 3D assets locally.

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub(
    "Voxel51/SynthHuman",
    name="SynthHuman",
    overwrite=True,
)

We'll just use the RGB images for this analysis.

In [None]:
synth_human_dataset = dataset.select_group_slices("rgb")

## Computing Embeddings and Visualizations

This cell performs several key operations:

1. **Registers the C-RADIO v3 model** - A state-of-the-art vision model from NVIDIA Labs

2. **Computes embeddings** - Generates 3048-dimensional feature vectors for each image

3. **Creates UMAP visualization** - Reduces high-dimensional embeddings to 2D for visualization


In [None]:
import fiftyone.brain as fob
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/NVLabs_CRADIOV3",
)

# returns a 1D embedding vector, dimensions 3048
embedding_model = foz.load_zoo_model(
    "nv_labs/c-radio_v3-h",
    output_type="summary",
    feature_format="NCHW",  # "NCHW": [Batch, Channels, Height, Width] , or you can use "NLC":[Batch, Num_patches, Channels]
    install_requirements=True
)

synth_human_dataset.compute_embeddings(
    model=embedding_model,
    embeddings_field="radio_embeddings"
)

# Create UMAP visualization
results = fob.compute_visualization(
    synth_human_dataset,
    method="umap",  # Also supports "tsne", "pca"
    brain_key="radio_viz",
    embeddings="radio_embeddings"
)



## Computing Similarity Index

Build a similarity index using the computed embeddings. This enables fast similarity searches to find images that are visually similar to any given sample:


In [None]:
results = fob.compute_similarity(
    synth_human_dataset,
    backend="sklearn",  # Fast sklearn backend
    brain_key="radio_sim",
    embeddings="radio_embeddings"
)


## Computing Representativeness Scores

Calculate how "representative" each sample is of the overall dataset. This helps identify:

- **Most representative samples** - Images that best capture the dataset's 
diversity

- **Outliers** - Unusual or unique samples that differ from the norm


In [None]:
# Compute representativeness scores
fob.compute_representativeness(
    synth_human_dataset,
    representativeness_field="radio_represent",
    method="cluster-center",
    embeddings="radio_embeddings"
)

# Find most representative samples
representative_view = synth_human_dataset.sort_by("radio_represent", reverse=True)

## Detecting Duplicates and Near-Duplicates

Use embeddings to identify duplicate or very similar images in the dataset. 

This is crucial for:
- Data cleaning and deduplication

- Understanding dataset quality

- Reducing redundancy in training data


In [None]:
# Detect duplicates using embeddings
results = fob.compute_uniqueness(
    synth_human_dataset,
    embeddings="radio_embeddings"
)

## Generating Spatial Heatmaps

Create spatial attention heatmaps that show which regions of each image the C-RADIO model focuses on. These heatmaps help understand:

- What visual features the model considers important

- How the model "sees" different parts of human images


In [None]:

# returns spatial features which are parsed as a FiftyOne Heatmap

spatial_model = foz.load_zoo_model(
    "nv_labs/c-radio_v3-h",
    output_type="spatial",
    apply_smoothing=True, # or False
    smoothing_sigma=0.51, # used only when apply_smoothing=True
    feature_format="NCHW"
)

synth_human_dataset.apply_model(spatial_model, "radio_heatmap")

# Enrich data using Qwen2.5VL

## Setting up Qwen2.5VL Model

Register and download the Qwen2.5-VL-3B-Instruct model, a powerful vision-language model that can:

- Classify images based on text prompts

- Answer questions about image content

- Generate detailed descriptions


In [None]:
# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/qwen2_5_vl")

# Download the model
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/qwen2_5_vl",
    model_name="Qwen/Qwen2.5-VL-3B-Instruct"
)

qwen_model = foz.load_zoo_model(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    install_requirements=True #if you are using for the first time and need to download reuirement,
)

## Classifying Body Poses

Use Qwen2.5VL to classify each image into pose categories:

- **Face**: Close-up facial shots

- **Full Body**: Complete body visible

- **Upper Body**: Torso and above visible


In [None]:
qwen_model.operation = "classify"

qwen_model.prompt = "Classify this image into exactly one of the following types: Face, Full Body, Upper Body"

synth_human_dataset.apply_model(qwen_model, label_field="pose")

## Re-running Pose Classification

This appears to be a duplicate of the previous pose classification step. You may want to remove this cell or modify it for a different classification task.


In [None]:
qwen_model.prompt = "Classify this image into exactly one of the following types: Face, Full Body, Upper Body"

synth_human_dataset.apply_model(qwen_model, label_field="pose")

## Analyzing Hair Color

Extract hair color information from each image using natural language prompts. This adds valuable demographic and appearance metadata to the dataset.


In [None]:
qwen_model.prompt = "What is the hair color of this person"

synth_human_dataset.apply_model(qwen_model, label_field="hair_color")

## Identifying Racial/Ethnic Characteristics

Classify the perceived race/ethnicity of individuals in the images. This demographic information can be useful for:
- Analyzing dataset diversity and representation
- Ensuring balanced training data for fair AI models


In [None]:
qwen_model.prompt = "What is the race of this person, provide only one response"

synth_human_dataset.apply_model(qwen_model, label_field="race")

## Detecting Facial Expressions

Classify facial expressions into seven basic emotion categories:

- Happiness, Sadness, Surprise, Fear, Anger, Disgust, Contempt

This adds emotional context to the dataset, useful for emotion recognition research and applications.


In [None]:
qwen_model.prompt = "Classify this image into exactly one of the following facial expressions: Happiness, Sadness, Surprise, Fear, Anger, Disgust, Contempt"

synth_human_dataset.apply_model(qwen_model, label_field="facial_expression")

## Generating Detailed Captions

Switch to Visual Question Answering (VQA) mode to generate rich, descriptive captions for each image. These captions include:

- Physical appearance details

- Facial expressions

- Hair styles  

- Activities or poses

This creates comprehensive textual descriptions that can be used for search, filtering, and understanding dataset content.


In [None]:
qwen_model.operation="vqa"

qwen_model.prompt="Briefly describe the person in this image, their facial expression, hair style, and what they are doing"

synth_human_dataset.apply_model(qwen_model, label_field="caption")
