TrajTok-v2

Official open-source release for "TrajTok-v2: Trajectory-aware visual tokenization for VLMs" (Zheng et al., 2026).

📄 Paper: arXiv:2602.22779 🤗 Released checkpoints:

Segmenter — michaelzch001/trajtokv2-segmenter
TrajViT-v2 — michaelzch001/trajtokv2-trajvitv2

TrajTok-v2 produces a small number of trajectory tokens per image / video clip that bundle pixels which belong to the same object instance over space and time. These tokens are then consumed by a vision transformer (TrajViT-v2) or a VLM (TrajVLM) at a fraction of the LLM-token cost of traditional patch tokenisation, while preserving fine-grained object-grounded information.

This repository hosts three self-contained sub-packages:

Package	What it gives you
`segmenter/`	The trajectory segmenter — DINOv3-small ConvNeXt + PerceiverResampler + soft-mask grouping. Trained on ~12 M images + videos (SA-1B, SA-V, internal mix). Released checkpoint + easy evaluation + qualitative demo.
`trajvitv2/`	TrajViT-v2 — a SegmentTokenizer wrapping the segmenter, followed by a CLIP-style ViT-Large transformer over trajectory tokens. Training + evaluation + checkpoint (small-scale Panda-70M filtered subset; not a final-scale model).
`trajvlm/`	TrajVLM — vision-language model: SigLIP2 ViT features pooled by our segmenter into trajectory tokens, fed into Qwen3-4B-Instruct. Training + evaluation code.

Each sub-package has its own README.md, dependencies, and quickstart. Start with the one matching what you want to do:

Want to use trajectory tokens for your own model? → segmenter/
Want to reproduce TrajViT-v2 retrieval experiments? → trajvitv2/
Want to train a VLM with trajectory tokens? → trajvlm/

Citation

@article{zheng2026trajtokv2,
  title   = {TrajTok-v2: Trajectory-aware visual tokenization for vision-language models},
  author  = {Zheng, Chenhao and others},
  journal = {arXiv preprint arXiv:2602.22779},
  year    = {2026},
}

License

Apache-2.0 — see LICENSE.

The DINOv3 ConvNeXt backbone used by the segmenter is bundled separately under its own Apache-2.0 license (Meta AI). SigLIP2 weights used by TrajVLM are licensed by Google under the Apache-2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrajTok-v2

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets/qual		assets/qual
release		release
segmenter		segmenter
trajvitv2		trajvitv2
trajvlm		trajvlm
.gitignore		.gitignore
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

TrajTok-v2

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages