Talking Points: Describing and Localizing Pixels
Matan Rusanovsky, Shimon Malnick, Shai Avidan
❓ Vision-language models excel at understanding objects and regions, but can they truly comprehend individual pixels? Can we describe a single keypoint in an image so precisely that the description alone allows us to locate that exact pixel?
↪️ Instead of relying on templated prompts or keypoint names, we introduce Talking Points - a framework that generates rich, free-form descriptions of individual pixels and localizes them back with high precision. We evaluate our descriptions not by comparing text, but by testing whether they can accurately guide localization.
- Download Pascal-Part-116 & ADE20K-Part-234 and PartImageNet(_Seg). Place these datasets in the
./datasetsfolder. - Download the OMG-LLaVA models, and place them in the
models/OMG-LLaVAfolder.
- Download the LLaVA-v1.6-34b model, and it in the
models/llava-v1.6-34bfolder. - Install ollama and:
ollama serve &
ollama pull llama3.3- Download the AP-10K dataset and place it in the
./datasetsfolder.
See INSTALL.md for detailed installation instructions.
You can download our Point Descriptor and Localizer models from huggingface, and place them in the ./models/TalkingPoints folder.
The annotations of our generated dataset are in ./LlamaPointInPart. The images can be downloaded from Pascal-Part-116 & ADE20K-Part-234 and from PartImageNet(_Seg).
OMG_LLAVA_CONFIG=./configs/omg_llava_dataset_creation.py
OMG_LLAVA_MODEL=./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth
LLAVA_MODEL=./models/llava-v1.6-34b
DATA_ROOT=./datasets/
OUT_DIR=./LlamaPointInPart
NUM_SAMPLES=7257
ollama run --keepalive 1s llama3.3
python src/generate_point_descriptions.py $OMG_LLAVA_CONFIG $OMG_LLAVA_MODEL $LLAVA_MODEL $DATA_ROOT $OUT_DIR $NUM_SAMPLES
python src/split_to_train_and_test.pyNUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
CHECKPOINT_DIR="./work_dirs/point_descriptor"
# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
| grep -oE 'iter_[0-9]+' \
| grep -oE '[0-9]+' \
| sort -n \
| tail -n 1)
# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
RESUME_FLAG="--resume $RESUME_PATH"
fi
PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints.py --deepspeed deepspeed_zero2 $RESUME_FLAGpython src/localize.py --omg_llava_config ./configs/omg_llava_7b_point_localizer.py --omg_llava_model ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth --dataset_root_dir ./datasets --eval_file_name ./work_dirs/localization/point_localizer_evaluation_log.json --train_annotations ./LlamaPointInPart/train.json --test_annotations ./LlamaPointInPart/test.json --models_dir ./work_dirs/localization/models --batch_size 8 --total_epochs 200 --state tuneCHECKPOINT_DIR="./work_dirs/ap10k_rl_bovidae"
# or: CHECKPOINT_DIR="./work_dirs/ap10k_rl_canidae"
# Find the latest iteration checkpoint directory
LATEST_CKPT=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_*.pth" \
| grep -oE 'iter_[0-9]+' \
| grep -oE '[0-9]+' \
| sort -n \
| tail -n 1)
# Build resume flag if checkpoint exists
RESUME_FLAG=""
if [ -n "$LATEST_CKPT" ]; then
RESUME_PATH=$(find "$CHECKPOINT_DIR" -maxdepth 1 -type d -name "iter_${LATEST_CKPT}.pth" | head -n 1)
RESUME_FLAG="--resume $RESUME_PATH"
fi
NUM_GPUS=$(nvidia-smi --list-gpus | wc -l)
PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae.py --deepspeed deepspeed_zero2 $RESUME_FLAG
# or: PYTHONPATH=. NPROC_PER_NODE=$NUM_GPUS xtuner train ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Canidae.py --deepspeed deepspeed_zero2 $RESUME_FLAGpython point_to_point.py ./configs/omg_llava_7b_talking_keypoints_with_localizer_ap10k_Bovidae_test.py ./models/OMG-LLaVA/omg_llava_7b_finetune_8gpus.pth ./models/TalkingPoints/PointLocalizer.pth --log-file ./work_dirs/ap10k_evals/evaluation_log_omg_llava_tested_on_Canidae.json --keypoints-info ./work_dirs/ap10k_evals/keypoints_info_omg_llava_tested_on_Canidae.json --plots-out-dir ./work_dirs/ap10k_evals/plots_omg_llava_tested_on_Canidae/ --des-batch-size 16 --loc-batch-size 16@article{rusanovsky2025talking,
title={Talking Points: Describing and Localizing Pixels},
author={Rusanovsky, Matan and Malnick, Shimon and Avidan, Shai},
journal={arXiv preprint arXiv:2510.14583},
year={2025}
}This repository is built upon and incorporates code from OMG-Seg and OMG-LLaVA. In addition, it uses the code from LLaVA.
This project follows the Apache-2.0 license, for the respect of both LLaVA and XTuner codebase.
