Talking Points: Describing and Localizing Pixels
ICLR 2026

Talking Points: Describing and Localizing Pixels

Matan Rusanovsky, Shimon Malnick, Shai Avidan

❓ Vision-language models excel at understanding objects and regions, but can they truly comprehend individual pixels? Can we describe a single keypoint in an image so precisely that the description alone allows us to locate that exact pixel?

↪️ Instead of relying on templated prompts or keypoint names, we introduce Talking Points - a framework that generates rich, free-form descriptions of individual pixels and localizes them back with high precision. We evaluate our descriptions not by comparing text, but by testing whether they can accurately guide localization.

Getting Started

Requirements:

Dataset Creation and Point Descriptor/Localizer Training:

Download Pascal-Part-116 & ADE20K-Part-234 and PartImageNet(_Seg). Place these datasets in the ./datasets folder.
Download the OMG-LLaVA models, and place them in the models/OMG-LLaVA folder.

Relevant only for Dataset Creation:

Download the LLaVA-v1.6-34b model, and it in the models/llava-v1.6-34b folder.
Install ollama and:

ollama serve &
ollama pull llama3.3

Relevant only for Point Descriptor RL Training:

Download the AP-10K dataset and place it in the ./datasets folder.

Environment setup

See INSTALL.md for detailed installation instructions.

Pretrained weights

You can download our Point Descriptor and Localizer models from huggingface, and place them in the ./models/TalkingPoints folder.

LlamaPointInPart Dataset

The annotations of our generated dataset are in ./LlamaPointInPart. The images can be downloaded from Pascal-Part-116 & ADE20K-Part-234 and from PartImageNet(_Seg).

Running

1. Dataset Creation

OMG_LLAVA_CONFIG=./configs/omg_llava_dataset_creation.py
OMG_LLAVA_MODEL=./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth
LLAVA_MODEL=./models/llava-v1.6-34b
DATA_ROOT=./datasets/
OUT_DIR=./LlamaPointInPart
NUM_SAMPLES=7257

ollama run --keepalive 1s llama3.3
python src/generate_point_descriptions.py $OMG_LLAVA_CONFIG $OMG_LLAVA_MODEL $LLAVA_MODEL $DATA_ROOT $OUT_DIR $NUM_SAMPLES
python src/split_to_train_and_test.py

2. Point Descriptor Training

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
CHECKPOINT_DIR="./work_dirs/point_descriptor"

# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
  | grep -oE 'iter_[0-9]+' \
  | grep -oE '[0-9]+' \
  | sort -n \
  | tail -n 1)

# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
    RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
    RESUME_FLAG="--resume $RESUME_PATH"
fi

PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints.py --deepspeed deepspeed_zero2 $RESUME_FLAG

3. Point Localizer Training

python src/localize.py --omg_llava_config ./configs/omg_llava_7b_point_localizer.py --omg_llava_model ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth  --dataset_root_dir ./datasets --eval_file_name ./work_dirs/localization/point_localizer_evaluation_log.json --train_annotations ./LlamaPointInPart/train.json --test_annotations ./LlamaPointInPart/test.json --models_dir ./work_dirs/localization/models --batch_size 8 --total_epochs 200 --state tune

4. Point Descriptor RL Training, using the Point Localizer as a reward model

Train on Bovidae or Canidae from AP-10K

CHECKPOINT_DIR="./work_dirs/ap10k_rl_bovidae"
# or: CHECKPOINT_DIR="./work_dirs/ap10k_rl_canidae"

# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
  | grep -oE 'iter_[0-9]+' \
  | grep -oE '[0-9]+' \
  | sort -n \
  | tail -n 1)

# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
    RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
    RESUME_FLAG="--resume $RESUME_PATH"
fi

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae.py --deepspeed deepspeed_zero2 $RESUME_FLAG
# or: PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Canidae.py --deepspeed deepspeed_zero2 $RESUME_FLAG

5. Evaluation - Point Descriptor & Localizer

Test the model (for example, the one that was trained on Bovidae from AP-10K)

python point_to_point.py ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae_test.py ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth ./models/TalkingPoints/PointLocalizer.pth --log-file ./work_dirs/ap10k_evals/evaluation_log_omg_llava_tested_on_Canidae.json --keypoints-info ./work_dirs/ap10k_evals/keypoints_info_omg_llava_tested_on_Canidae.json --plots-out-dir ./work_dirs/ap10k_evals/plots_omg_llava_tested_on_Canidae/ --des-batch-size 16 --loc-batch-size 16

BibTex

@article{rusanovsky2025talking,
  title={Talking Points: Describing and Localizing Pixels},
  author={Rusanovsky, Matan and Malnick, Shimon and Avidan, Shai},
  journal={arXiv preprint arXiv:2510.14583},
  year={2025}
}

Acknowledgments

This repository is built upon and incorporates code from OMG-Seg and OMG-LLaVA. In addition, it uses the code from LLaVA.

License

This project follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LLaVA		LLaVA
LlamaPointInPart		LlamaPointInPart
OMG-Seg		OMG-Seg
configs		configs
images		images
merged_ap10k_annotations		merged_ap10k_annotations
project_page_data		project_page_data
scripts		scripts
src		src
INSTALL.md		INSTALL.md
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Talking Points: Describing and Localizing Pixels
ICLR 2026

Getting Started

Requirements:

Dataset Creation and Point Descriptor/Localizer Training:

Relevant only for Dataset Creation:

Relevant only for Point Descriptor RL Training:

Environment setup

Pretrained weights

LlamaPointInPart Dataset

Running

1. Dataset Creation

2. Point Descriptor Training

3. Point Localizer Training

4. Point Descriptor RL Training, using the Point Localizer as a reward model

Train on Bovidae or Canidae from AP-10K

5. Evaluation - Point Descriptor & Localizer

Test the model (for example, the one that was trained on Bovidae from AP-10K)

BibTex

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Talking Points: Describing and Localizing Pixels ICLR 2026

Getting Started

Requirements:

Dataset Creation and Point Descriptor/Localizer Training:

Relevant only for Dataset Creation:

Relevant only for Point Descriptor RL Training:

Environment setup

Pretrained weights

LlamaPointInPart Dataset

Running

1. Dataset Creation

2. Point Descriptor Training

3. Point Localizer Training

4. Point Descriptor RL Training, using the Point Localizer as a reward model

Train on Bovidae or Canidae from AP-10K

5. Evaluation - Point Descriptor & Localizer

Test the model (for example, the one that was trained on Bovidae from AP-10K)

BibTex

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Talking Points: Describing and Localizing Pixels
ICLR 2026

Packages