Skip to content

HumanMLLM/SWIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWIM: See What I Mean

Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Boyuan Sun1,2Bowen Yin1,2Yuanming Li2Xihan Wei2Qibin Hou1†

1 VCIP, Nankai University   2 Tongyi Lab, Alibaba Group   Corresponding author

Paper PDF


Overview

SWIM enables multimodal large language models to understand specific objects in videos at a fine-grained level. Given a video and a natural language reference to a target object, SWIM can accurately describe the object's appearance, actions, and temporal dynamics — while avoiding hallucination about irrelevant objects.

Core idea — Apply attention-level supervision during training so the model learns to attend to the correct visual regions when generating descriptions of a referred object.

SWIM Pipeline

Highlights

1NL-Refer DatasetA natural-language referring dataset built on top of VideoRefer-700K. Unlike the original which uses visual prompts (colored masks), NL-Refer replaces them with natural language descriptions, enabling a more practical and scalable referring paradigm.
2Attention SupervisionDuring SFT, the model receives additional loss on attention maps to encourage correct grounding between <ins>...</ins>-tagged entity tokens and the corresponding visual regions.
3Selective Fine-tuningThe vision encoder is frozen; only the language model is updated, keeping training efficient.

News

  • 2026-05 — Code of SWIM is released.
  • 2026-05 — Paper of SWIM is released.

Getting Started

1. Installation

git clone git@github.com:HumanMLLM/SWIM.git
cd SWIM

conda create -n swim python=3.10
conda activate swim

# Core dependencies
cd Q-R1
pip install -e .
pip install trl

# Modified transformers (required for attention supervision)
cd ../transformers
pip install -e .

pip install matplotlib huggingface_hub

Note — SWIM depends on a custom fork of HuggingFace Transformers shipped in transformers/. You must install it from this repo, not from PyPI.

2. Download Model

mkdir model_zoo && cd model_zoo

# (Optional) HuggingFace mirror for faster download in China
# export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download BBBBCHAN/SWIM-7B --local-dir SWIM-7B

3. Download Data — NL-Refer (For Training)

We introduce NL-Refer, a natural-language referring dataset built on top of VideoRefer-700K. Unlike the original dataset which uses visual prompts (colored masks overlaid on frames) to indicate target objects, NL-Refer replaces them with natural language referring expressions — enabling a more practical paradigm where users simply describe the object in words.

The dataset is constructed by using GPT-4o to rewrite <objectx><region> placeholders into concise NL descriptions, with the core referring word tagged as <ins>...</ins> for attention supervision.

Dataset and construction scripts: BBBBCHAN/NL-Refer

Dataset Structure
NL-Refer/
├── train/                                        # Training annotations
│   ├── refined-format-videorefer-detailed-caption-*.json   # NL-Refer-D (~125K, 4 shards)
│   ├── refined-format-videorefer-qa-0-10k.json             # NL-Refer-Q (~10K)
│   └── filtered_valid_llava_video_178k_*.json              # LLaVA-Video supplementary
├── bench/                                        # Evaluation benchmarks
│   ├── refined-VideoRefer-Bench-D.json           # Description generation (400 samples)
│   ├── refined-VideoRefer-Bench-Q.json           # Multiple-choice QA (1000 samples)
│   └── *-synonym.json                            # Synonym-augmented variants
└── scripts/                                      # Dataset construction pipeline
    ├── construction/                             # GPT-4o rewriting scripts
    └── llava_video/                              # LLaVA-Video processing

Training annotation JSONs are hosted on BBBBCHAN/SWIM_data due to their size (~7GB total).

# Download benchmarks and construction scripts
huggingface-cli download --resume-download BBBBCHAN/NL-Refer --repo-type dataset --local-dir NL-Refer

Usage

Inference

SWIM-7B is fine-tuned from Qwen2.5-VL-7B-Instruct and shares the same inference API.

Quick Start
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model (flash_attention_2 recommended for speed and memory)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "BBBBCHAN/SWIM-7B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B")

# Optionally control visual token budget:
# processor = AutoProcessor.from_pretrained(
#     "BBBBCHAN/SWIM-7B", min_pixels=256*28*28, max_pixels=1280*28*28
# )

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
).to("cuda")

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training

Training uses DeepSpeed Zero-3 with 8 GPUs, BF16 mixed precision, and Flash Attention 2.

cd Q-R1/src/open-r1-multimodal

# Edit run_scripts/run_sft_videorefer_qwen25vl.sh to set:
#   --model_name_or_path  (path to Qwen2.5-VL-7B-Instruct)
#   --image_root          (path to your image/video data root)

bash run_scripts/run_sft_videorefer_qwen25vl.sh
Key Training Parameters
Parameter Value
Base model Qwen2.5-VL-7B-Instruct
Batch size 1 per GPU × 4 gradient accumulation steps
Learning rate 2e-5 (cosine schedule)
Epochs 1
Precision BF16
Distribution DeepSpeed Zero-3 offload

Training data is configured in Q-R1/src/open-r1-multimodal/data_config/videorefer.yaml, combining ~125K NL-Refer samples with LLaVA-Video data.

Evaluation

VideoRefer-Bench

cd Q-R1/src/open-r1-multimodal/run_scripts/eval/videorefer

# VideoRefer-Bench-Q (multiple-choice QA)
bash eval_videorefer-bench-q_qwen2_5vl.sh

# VideoRefer-Bench-D (description generation, requires GPT-4o API key)
bash eval_videorefer-bench-d_qwen2_5vl.sh

For benchmark data and format details, see the VideoRefer-Bench README.

General Benchmarks

General video understanding benchmarks (MVBench, VideoMME, ActivityNetQA, etc.) are evaluated via lmms_eval:

MODEL_NAME="SWIM-7B"
MODEL_PATH="BBBBCHAN/SWIM-7B"

accelerate launch --num_processes 8 --main_process_port 23553 -m lmms_eval \
    --model qwen2_5_vl \
    --model_args pretrained=$MODEL_PATH,use_flash_attention_2=true \
    --tasks mvbench \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix eval \
    --output_path ./logs/$MODEL_NAME

Replace --tasks mvbench with videomme, activitynetqa, etc. for other benchmarks.


Project Structure

SWIM/
├── Q-R1/src/open-r1-multimodal/
│   ├── src/open_r1/
│   │   ├── sft_videorefer_qwen25vl.py    # Training entry, data loading, collation
│   │   ├── inference_qwen25vl.py          # Inference & attention visualization
│   │   ├── calc_attn_mask*.py             # Attention mask analysis tools
│   │   ├── data_process/
│   │   │   └── vision_process.py          # Video frame extraction & image processing
│   │   ├── trainer/                       # Custom trainers (GRPO, OLA-GRPO, vLLM-GRPO)
│   │   └── utils/                         # Evaluation helpers & callbacks
│   ├── data_config/videorefer.yaml        # Training dataset configuration
│   ├── run_scripts/
│   │   ├── run_sft_videorefer_qwen25vl.sh # Training launch script
│   │   └── eval/                          # Evaluation scripts
│   └── configs/                           # DeepSpeed / DDP configs
├── transformers/src/transformers/models/
│   └── qwen2_5_vl/modeling_qwen2_5_vl.py # Modified model with attention supervision
└── vis/                                   # Visualization utilities

Core Code Guide

Below is a map of the key code paths for anyone looking to understand or extend SWIM.

Training Pipeline

Q-R1/src/open-r1-multimodal/src/open_r1/sft_videorefer_qwen25vl.py

What Location Description
Dataset class LazySupervisedDataset (L99) Loads multi-dataset from YAML config with flexible sampling strategies
Data format conversion _maybe_apply_format_convert_videorefer Converts VideoRefer JSON to conversation format, decodes RLE masks
Instance tag extraction extract_ins_with_global_occurrence (L337) Extracts <ins>...</ins> tagged entities and their occurrence counts
Batch collation collate_fn (L380) Tokenizes text, processes vision inputs, and constructs supervision labels
  Attention labels   L496 – L512 Creates attn_labels marking valid token positions for attention loss
  Instance labels   L515 – L541 Creates ins_contents_labels mapping entity tokens to instance indices
Trainer init SFTTrainer(...) (L653) Assembles model, dataset, and collate_fn into the trainer
Training loop trainer.train() (L673) Launches the training loop with optional checkpoint resume

Loss Computation

transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

What Location Description
Text loss loss_function (L2041) Standard cross-entropy on language modeling logits
Attention extraction extract_and_fuse_attentions (L2078) Extracts attention maps from layers [2, 7, 12, 17, 22, 27] and fuses across heads
Label filtering build_label_and_index (L2177) Filters out ignored tokens (-100) to get valid supervision indices
Pred-GT pair collection collect_pred_gt_pairs_from_fused_attn (L2113) Pairs predicted attention masks with ground-truth object masks
Mask loss compute_bce_loss_from_pairs (L2212) Binary cross-entropy between predicted attention and GT masks
Combined loss L2373 loss = text_loss * 0.05 + loss_mask

Video / Image Processing

Q-R1/src/open-r1-multimodal/src/open_r1/data_process/vision_process.py

What Location Description
Video loading fetch_video (L459) Reads video via Decord, samples frames at target FPS, smart resize
Image loading fetch_image (L106) Handles local / HTTP / base64 / PIL sources, aspect-ratio-aware resize
Vision info dispatch process_vision_info (L678) Extracts mask info from conversations, routes to image or video processing

Model Modifications

transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py

The Qwen2.5-VL forward pass (L2235) is extended with four additional inputs for attention supervision:

Parameter Purpose
attn_labels Marks which token positions participate in the attention loss
ins_contents_labels Maps each entity token to its instance index
ins_masks Ground-truth binary segmentation masks per instance
mask_index Indicates which video frames carry mask annotations

The attention supervision pipeline runs at L2339 – L2363: extract multi-layer attention → filter valid tokens → collect pred/GT pairs → compute BCE loss.


Citation

If you find this work useful, please consider citing:

@inproceedings{sun2026swim,
  title     = {See What I Mean: Aligning Vision and Language Representations
               for Video Fine-grained Object Understanding},
  author    = {Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

License

This code is licensed under CC BY-NC 4.0 for non-commercial use only. Commercial use requires prior written permission.

Contact

Technical questions sbysbysby123[AT]gmail.com
Commercial licensing andrewhoux[AT]gmail.com
Jobs / internships at Tongyi Lab xihan.wxh@alibaba-inc.com   (WeChat: weixihan1)

Acknowledgement

We thank open-r1, PixelRefer, Qwen2.5-VL, transformers, lmms_eval, and LLaVA-Video-178K for their excellent work.

About

Official Code for See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding (CVPR 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors