Boyuan Sun1,2 Bowen Yin1,2 Yuanming Li2 Xihan Wei2 Qibin Hou1†
1 VCIP, Nankai University 2 Tongyi Lab, Alibaba Group † Corresponding author
SWIM enables multimodal large language models to understand specific objects in videos at a fine-grained level. Given a video and a natural language reference to a target object, SWIM can accurately describe the object's appearance, actions, and temporal dynamics — while avoiding hallucination about irrelevant objects.
Core idea — Apply attention-level supervision during training so the model learns to attend to the correct visual regions when generating descriptions of a referred object.
| 1 | NL-Refer Dataset | A natural-language referring dataset built on top of VideoRefer-700K. Unlike the original which uses visual prompts (colored masks), NL-Refer replaces them with natural language descriptions, enabling a more practical and scalable referring paradigm. |
| 2 | Attention Supervision | During SFT, the model receives additional loss on attention maps to encourage correct grounding between <ins>...</ins>-tagged entity tokens and the corresponding visual regions. |
| 3 | Selective Fine-tuning | The vision encoder is frozen; only the language model is updated, keeping training efficient. |
git clone git@github.com:HumanMLLM/SWIM.git
cd SWIM
conda create -n swim python=3.10
conda activate swim
# Core dependencies
cd Q-R1
pip install -e .
pip install trl
# Modified transformers (required for attention supervision)
cd ../transformers
pip install -e .
pip install matplotlib huggingface_hubNote — SWIM depends on a custom fork of HuggingFace Transformers shipped in
transformers/. You must install it from this repo, not from PyPI.
mkdir model_zoo && cd model_zoo
# (Optional) HuggingFace mirror for faster download in China
# export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download BBBBCHAN/SWIM-7B --local-dir SWIM-7BWe introduce NL-Refer, a natural-language referring dataset built on top of VideoRefer-700K. Unlike the original dataset which uses visual prompts (colored masks overlaid on frames) to indicate target objects, NL-Refer replaces them with natural language referring expressions — enabling a more practical paradigm where users simply describe the object in words.
The dataset is constructed by using GPT-4o to rewrite <objectx><region> placeholders into concise NL descriptions, with the core referring word tagged as <ins>...</ins> for attention supervision.
Dataset and construction scripts: BBBBCHAN/NL-Refer
Dataset Structure
NL-Refer/
├── train/ # Training annotations
│ ├── refined-format-videorefer-detailed-caption-*.json # NL-Refer-D (~125K, 4 shards)
│ ├── refined-format-videorefer-qa-0-10k.json # NL-Refer-Q (~10K)
│ └── filtered_valid_llava_video_178k_*.json # LLaVA-Video supplementary
├── bench/ # Evaluation benchmarks
│ ├── refined-VideoRefer-Bench-D.json # Description generation (400 samples)
│ ├── refined-VideoRefer-Bench-Q.json # Multiple-choice QA (1000 samples)
│ └── *-synonym.json # Synonym-augmented variants
└── scripts/ # Dataset construction pipeline
├── construction/ # GPT-4o rewriting scripts
└── llava_video/ # LLaVA-Video processing
Training annotation JSONs are hosted on BBBBCHAN/SWIM_data due to their size (~7GB total).
# Download benchmarks and construction scripts
huggingface-cli download --resume-download BBBBCHAN/NL-Refer --repo-type dataset --local-dir NL-ReferSWIM-7B is fine-tuned from Qwen2.5-VL-7B-Instruct and shares the same inference API.
Quick Start
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model (flash_attention_2 recommended for speed and memory)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"BBBBCHAN/SWIM-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B")
# Optionally control visual token budget:
# processor = AutoProcessor.from_pretrained(
# "BBBBCHAN/SWIM-7B", min_pixels=256*28*28, max_pixels=1280*28*28
# )
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
).to("cuda")
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)Training uses DeepSpeed Zero-3 with 8 GPUs, BF16 mixed precision, and Flash Attention 2.
cd Q-R1/src/open-r1-multimodal
# Edit run_scripts/run_sft_videorefer_qwen25vl.sh to set:
# --model_name_or_path (path to Qwen2.5-VL-7B-Instruct)
# --image_root (path to your image/video data root)
bash run_scripts/run_sft_videorefer_qwen25vl.shKey Training Parameters
| Parameter | Value |
|---|---|
| Base model | Qwen2.5-VL-7B-Instruct |
| Batch size | 1 per GPU × 4 gradient accumulation steps |
| Learning rate | 2e-5 (cosine schedule) |
| Epochs | 1 |
| Precision | BF16 |
| Distribution | DeepSpeed Zero-3 offload |
Training data is configured in Q-R1/src/open-r1-multimodal/data_config/videorefer.yaml, combining ~125K NL-Refer samples with LLaVA-Video data.
cd Q-R1/src/open-r1-multimodal/run_scripts/eval/videorefer
# VideoRefer-Bench-Q (multiple-choice QA)
bash eval_videorefer-bench-q_qwen2_5vl.sh
# VideoRefer-Bench-D (description generation, requires GPT-4o API key)
bash eval_videorefer-bench-d_qwen2_5vl.shFor benchmark data and format details, see the VideoRefer-Bench README.
General video understanding benchmarks (MVBench, VideoMME, ActivityNetQA, etc.) are evaluated via lmms_eval:
MODEL_NAME="SWIM-7B"
MODEL_PATH="BBBBCHAN/SWIM-7B"
accelerate launch --num_processes 8 --main_process_port 23553 -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=$MODEL_PATH,use_flash_attention_2=true \
--tasks mvbench \
--batch_size 1 \
--log_samples \
--log_samples_suffix eval \
--output_path ./logs/$MODEL_NAMEReplace --tasks mvbench with videomme, activitynetqa, etc. for other benchmarks.
SWIM/
├── Q-R1/src/open-r1-multimodal/
│ ├── src/open_r1/
│ │ ├── sft_videorefer_qwen25vl.py # Training entry, data loading, collation
│ │ ├── inference_qwen25vl.py # Inference & attention visualization
│ │ ├── calc_attn_mask*.py # Attention mask analysis tools
│ │ ├── data_process/
│ │ │ └── vision_process.py # Video frame extraction & image processing
│ │ ├── trainer/ # Custom trainers (GRPO, OLA-GRPO, vLLM-GRPO)
│ │ └── utils/ # Evaluation helpers & callbacks
│ ├── data_config/videorefer.yaml # Training dataset configuration
│ ├── run_scripts/
│ │ ├── run_sft_videorefer_qwen25vl.sh # Training launch script
│ │ └── eval/ # Evaluation scripts
│ └── configs/ # DeepSpeed / DDP configs
├── transformers/src/transformers/models/
│ └── qwen2_5_vl/modeling_qwen2_5_vl.py # Modified model with attention supervision
└── vis/ # Visualization utilities
Below is a map of the key code paths for anyone looking to understand or extend SWIM.
Q-R1/src/open-r1-multimodal/src/open_r1/sft_videorefer_qwen25vl.py
| What | Location | Description |
|---|---|---|
| Dataset class | LazySupervisedDataset (L99) |
Loads multi-dataset from YAML config with flexible sampling strategies |
| Data format conversion | _maybe_apply_format_convert_videorefer |
Converts VideoRefer JSON to conversation format, decodes RLE masks |
| Instance tag extraction | extract_ins_with_global_occurrence (L337) |
Extracts <ins>...</ins> tagged entities and their occurrence counts |
| Batch collation | collate_fn (L380) |
Tokenizes text, processes vision inputs, and constructs supervision labels |
| Attention labels | L496 – L512 | Creates attn_labels marking valid token positions for attention loss |
| Instance labels | L515 – L541 | Creates ins_contents_labels mapping entity tokens to instance indices |
| Trainer init | SFTTrainer(...) (L653) |
Assembles model, dataset, and collate_fn into the trainer |
| Training loop | trainer.train() (L673) |
Launches the training loop with optional checkpoint resume |
transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py
| What | Location | Description |
|---|---|---|
| Text loss | loss_function (L2041) |
Standard cross-entropy on language modeling logits |
| Attention extraction | extract_and_fuse_attentions (L2078) |
Extracts attention maps from layers [2, 7, 12, 17, 22, 27] and fuses across heads |
| Label filtering | build_label_and_index (L2177) |
Filters out ignored tokens (-100) to get valid supervision indices |
| Pred-GT pair collection | collect_pred_gt_pairs_from_fused_attn (L2113) |
Pairs predicted attention masks with ground-truth object masks |
| Mask loss | compute_bce_loss_from_pairs (L2212) |
Binary cross-entropy between predicted attention and GT masks |
| Combined loss | L2373 | loss = text_loss * 0.05 + loss_mask |
Q-R1/src/open-r1-multimodal/src/open_r1/data_process/vision_process.py
| What | Location | Description |
|---|---|---|
| Video loading | fetch_video (L459) |
Reads video via Decord, samples frames at target FPS, smart resize |
| Image loading | fetch_image (L106) |
Handles local / HTTP / base64 / PIL sources, aspect-ratio-aware resize |
| Vision info dispatch | process_vision_info (L678) |
Extracts mask info from conversations, routes to image or video processing |
transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py
The Qwen2.5-VL forward pass (L2235) is extended with four additional inputs for attention supervision:
| Parameter | Purpose |
|---|---|
attn_labels |
Marks which token positions participate in the attention loss |
ins_contents_labels |
Maps each entity token to its instance index |
ins_masks |
Ground-truth binary segmentation masks per instance |
mask_index |
Indicates which video frames carry mask annotations |
The attention supervision pipeline runs at L2339 – L2363: extract multi-layer attention → filter valid tokens → collect pred/GT pairs → compute BCE loss.
If you find this work useful, please consider citing:
@inproceedings{sun2026swim,
title = {See What I Mean: Aligning Vision and Language Representations
for Video Fine-grained Object Understanding},
author = {Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}This code is licensed under CC BY-NC 4.0 for non-commercial use only. Commercial use requires prior written permission.
| Technical questions | sbysbysby123[AT]gmail.com |
| Commercial licensing | andrewhoux[AT]gmail.com |
| Jobs / internships at Tongyi Lab | xihan.wxh@alibaba-inc.com (WeChat: weixihan1) |
We thank open-r1, PixelRefer, Qwen2.5-VL, transformers, lmms_eval, and LLaVA-Video-178K for their excellent work.
