Awesome-Referring-Video-Object-Segmentation / Tracking

Welcome to starts ⭐ & comments 💹 & sharing 😀 !!

- 2021.12.12: Recent papers (from 2021) 
- welcome to add if any information misses. 😎

Introduction

Referring video object segmentation aims at segmenting an object in video with language expressions.

Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, to identify and segment an object referred by the given language expressions in a video. A detailed explanation of the new task can be found in the following paper.

Seonguk Seo, Joon-Young Lee, Bohyung Han, “URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark”, [ECCV20]:https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123600205.pdf

Impressive Works Related to Referring Video Object Segmentation (RVOS)

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency[ICCV 2023]:https://arxiv.org/abs/2207.01203 [Repo] (https://github.com/lxa9867/R2VOS)

Spectrum-guided Multi-granularity Referring Video Object Segmentation[ICCV 2023]:https://arxiv.org/pdf/2307.13537.pdf

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation[ICCV 2023]:https://arxiv.org/pdf/2307.09356.pdf

Decoupling Multimodal Transformers for Referring Video Object Segmentation TCSVT23
Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning TCSVT23
Referring Video Segmentation with (Optional) Arbitrary Modality as Query for Fusion ArXiV

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [PAMI23]
Multi-Attention Network for Compressed Video Referring Object Segmentation[ACM MM 2022]:https://arxiv.org/pdf/2207.12622.pdf
Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [CVPR 2022]:https://arxiv.org/pdf/2206.03789.pdf

Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation [CVPR 2022]

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [CVPR 2022]:https://arxiv.org/pdf/2204.02547.pdf

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [ArXiv 2022]:https://arxiv.org/pdf/2203.15969.pdf
Local-Global Context Aware Transformer for Language-Guided Video Segmentation [ArXiv 2022]:https://github.com/leonnnop/Locater

ReferFormer [CVPR 2022]:https://arxiv.org/pdf/2201.00487.pdf
MTTR [CVPR 2022]:https://github.com/mttr2021/MTTR
YOFO [AAAI 2022]:https://www.aaai.org/AAAI22Papers/AAAI-1100.LiD.pdf
You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation
ClawCraneNet [ArXiv]:https://arxiv.org/pdf/2103.10702.pdf
PMINet [CVPRW 2021]:https://youtube-vos.org/assets/challenge/2021/reports/RVOS_2_Ding.pdf
RVOS challenge 1st model [CVPRW 2021]:https://arxiv.org/pdf/2106.01061.pdf

CMPC-V [PAMI 2021]:https://github.com/spyflying/CMPC-Refseg

Cross-modal progressive comprehension for referring segmentation:https://arxiv.org/abs/2105.07175

HINet [BMVC 2021]:https://www.bmvc2021-virtualconference.com/assets/papers/0386.pdf
URVOS [ECCV 2020]:https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123600205.pdf

Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation:https://arxiv.org/pdf/2112.02244.pdf
SeqTR: A Simple yet Universal Network for Visual Grounding:https://arxiv.org/pdf/2203.16265.pdf

Impressive Works Related to Referring Multi-Object Tracking (RMOT)

Referring Multi-Object Tracking[CVPR 23]:https://arxiv.org/abs/2303.03366

Benchmark

The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation

Datasets

Refer-YouTube-VOS-datasets

YouTube-VOS:

wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py
python down_YTVOS_w_refer.py

Folder structure:

${current_path}/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

A2D-Sentences:

REPO:https://web.eecs.umich.edu/~jjcorso/r/a2d/

paper:https://arxiv.org/abs/1803.07485

Citation:

@misc{gavrilyuk2018actor,
      title={Actor and Action Video Segmentation from a Sentence}, 
      author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},
      year={2018},
      eprint={1803.07485},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License: The dataset may not be republished in any form without the written consent of the authors.

README Dataset and Annotation (version 1.0, 1.9GB, tar.bz) Evaluation Toolkit (version 1.0, tar.bz)

mkdir a2d_sentences
cd a2d_sentences
wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz
tar jxvf A2D_main_1_0.tar.bz
mkdir text_annotations

cd text_annotations
wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt
wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py
python down_a2d_annotation_with_instances.py
unzip a2d_annotation_with_instances.zip
#rm a2d_annotation_with_instances.zip
cd ..

cd ..

Folder structure:

${current_path}/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files)

Citation:

@inproceedings{YaXuCaCVPR2017,
  author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},
  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
  tags = {computer vision, activity recognition, video understanding, semantic segmentation},
  title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},
  year = {2017}
}
@inproceedings{XuCoCVPR2016,
  author = {Xu, C. and {\bf Corso}, {\bf J. J.}},
  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
  datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
  tags = {computer vision, activity recognition, video understanding, semantic segmentation},
  title = {Actor-Action Semantic Segmentation with Grouping-Process Models},
  year = {2016}
}
@inproceedings{XuHsXiCVPR2015,
  author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},
  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
  datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
  poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},
  tags = {computer vision, activity recognition, video understanding, semantic segmentation},
  title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},
  url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},
  year = {2015}
}

J-HMDB:http://jhmdb.is.tue.mpg.de/

downloading_script

mkdir jhmdb_sentences
cd jhmdb_sentences
wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz
wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt
wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip
tar -xzvf  Rename_Images.tar.gz
unzip puppet_mask.zip
cd ..

Folder structure:

${current_path}/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Citation:

@inproceedings{Jhuang:ICCV:2013,
title = {Towards understanding action recognition},
author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
booktitle = {International Conf. on Computer Vision (ICCV)},
month = Dec,
pages = {3192-3199},
year = {2013}
}

refer-DAVIS16/17:[https://arxiv.org/pdf/1803.08006.pdf]

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
LICENSE		LICENSE
README.md		README.md
dataset_preparation.sh		dataset_preparation.sh
down_YTVOS_w_refer.py		down_YTVOS_w_refer.py
down_a2d_annotation_with_instances.py		down_a2d_annotation_with_instances.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

dataset_preparation.sh

dataset_preparation.sh

down_YTVOS_w_refer.py

down_YTVOS_w_refer.py

down_a2d_annotation_with_instances.py

down_a2d_annotation_with_instances.py

Repository files navigation

Awesome-Referring-Video-Object-Segmentation / Tracking

Introduction

Impressive Works Related to Referring Video Object Segmentation (RVOS)

Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)

Impressive Works Related to Referring Multi-Object Tracking (RMOT)

Benchmark

Datasets

About

Releases

Packages

Languages

License

JerryX1110/awesome-rvos

Folders and files

Latest commit

History

Repository files navigation

Awesome-Referring-Video-Object-Segmentation / Tracking

Introduction

Impressive Works Related to Referring Video Object Segmentation (RVOS)

Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)

Impressive Works Related to Referring Multi-Object Tracking (RMOT)

Benchmark

Datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Languages