Skip to content

matanr/Talking_Points

Repository files navigation

Talking Points: Describing and Localizing Pixels
ICLR 2026

Talking Points: Describing and Localizing Pixels

Matan Rusanovsky, Shimon Malnick, Shai Avidan

❓ Vision-language models excel at understanding objects and regions, but can they truly comprehend individual pixels? Can we describe a single keypoint in an image so precisely that the description alone allows us to locate that exact pixel?

↪️ Instead of relying on templated prompts or keypoint names, we introduce Talking Points - a framework that generates rich, free-form descriptions of individual pixels and localizes them back with high precision. We evaluate our descriptions not by comparing text, but by testing whether they can accurately guide localization.


Getting Started

Requirements:

Dataset Creation and Point Descriptor/Localizer Training:

  1. Download Pascal-Part-116 & ADE20K-Part-234 and PartImageNet(_Seg). Place these datasets in the ./datasets folder.
  2. Download the OMG-LLaVA models, and place them in the models/OMG-LLaVA folder.

Relevant only for Dataset Creation:

  1. Download the LLaVA-v1.6-34b model, and it in the models/llava-v1.6-34b folder.
  2. Install ollama and:
ollama serve &
ollama pull llama3.3

Relevant only for Point Descriptor RL Training:

  1. Download the AP-10K dataset and place it in the ./datasets folder.

Environment setup

See INSTALL.md for detailed installation instructions.

Pretrained weights

You can download our Point Descriptor and Localizer models from huggingface, and place them in the ./models/TalkingPoints folder.

LlamaPointInPart Dataset

The annotations of our generated dataset are in ./LlamaPointInPart. The images can be downloaded from Pascal-Part-116 & ADE20K-Part-234 and from PartImageNet(_Seg).

Running

1. Dataset Creation

OMG_LLAVA_CONFIG=./configs/omg_llava_dataset_creation.py
OMG_LLAVA_MODEL=./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth
LLAVA_MODEL=./models/llava-v1.6-34b
DATA_ROOT=./datasets/
OUT_DIR=./LlamaPointInPart
NUM_SAMPLES=7257

ollama run --keepalive 1s llama3.3
python src/generate_point_descriptions.py $OMG_LLAVA_CONFIG $OMG_LLAVA_MODEL $LLAVA_MODEL $DATA_ROOT $OUT_DIR $NUM_SAMPLES
python src/split_to_train_and_test.py

2. Point Descriptor Training

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
CHECKPOINT_DIR="./work_dirs/point_descriptor"

# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
  | grep -oE 'iter_[0-9]+' \
  | grep -oE '[0-9]+' \
  | sort -n \
  | tail -n 1)

# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
    RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
    RESUME_FLAG="--resume $RESUME_PATH"
fi

PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints.py --deepspeed deepspeed_zero2 $RESUME_FLAG

3. Point Localizer Training

python src/localize.py --omg_llava_config ./configs/omg_llava_7b_point_localizer.py --omg_llava_model ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth  --dataset_root_dir ./datasets --eval_file_name ./work_dirs/localization/point_localizer_evaluation_log.json --train_annotations ./LlamaPointInPart/train.json --test_annotations ./LlamaPointInPart/test.json --models_dir ./work_dirs/localization/models --batch_size 8 --total_epochs 200 --state tune

4. Point Descriptor RL Training, using the Point Localizer as a reward model

Train on Bovidae or Canidae from AP-10K

CHECKPOINT_DIR="./work_dirs/ap10k_rl_bovidae"
# or: CHECKPOINT_DIR="./work_dirs/ap10k_rl_canidae"

# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
  | grep -oE 'iter_[0-9]+' \
  | grep -oE '[0-9]+' \
  | sort -n \
  | tail -n 1)

# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
    RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
    RESUME_FLAG="--resume $RESUME_PATH"
fi

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae.py --deepspeed deepspeed_zero2 $RESUME_FLAG
# or: PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Canidae.py --deepspeed deepspeed_zero2 $RESUME_FLAG

5. Evaluation - Point Descriptor & Localizer

Test the model (for example, the one that was trained on Bovidae from AP-10K)

python point_to_point.py ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae_test.py ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth ./models/TalkingPoints/PointLocalizer.pth --log-file ./work_dirs/ap10k_evals/evaluation_log_omg_llava_tested_on_Canidae.json --keypoints-info ./work_dirs/ap10k_evals/keypoints_info_omg_llava_tested_on_Canidae.json --plots-out-dir ./work_dirs/ap10k_evals/plots_omg_llava_tested_on_Canidae/ --des-batch-size 16 --loc-batch-size 16

BibTex

@article{rusanovsky2025talking,
  title={Talking Points: Describing and Localizing Pixels},
  author={Rusanovsky, Matan and Malnick, Shimon and Avidan, Shai},
  journal={arXiv preprint arXiv:2510.14583},
  year={2025}
}

Acknowledgments

This repository is built upon and incorporates code from OMG-Seg and OMG-LLaVA. In addition, it uses the code from LLaVA.

License

This project follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages