TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection (AAAI 2024 Paper)

by Hao Sun^{* 1}, Mingyao Zhou^{* 1}, Wenjing Chen^†2, Wei Xie^†1

¹ Central China Normal University, ² Hubei University of Technology, ^* Equal Contribution, ^† Corresponding authors.

Prerequisites

0. Clone this repo

1. Prepare datasets

QVHighlights : Download official feature files for QVHighlights dataset from Moment-DETR.

Download moment_detr_features.tar.gz (8GB), extract it under '../features' directory. You can change the data directory by modifying 'feat_root' in shell scripts under 'tr_detr/scripts/' directory.

tar -xf path/to/moment_detr_features.tar.gz

TVSum : Download feature files for TVSum dataset from UMT.

Download TVSum (69.1MB), and either extract it under '../features/tvsum/' directory or change 'feat_root' in TVSum shell files under 'tr_detr/scripts/tvsum/'.

2. Install dependencies. Python version 3.7 is required.

pip install -r requirements.txt

Requirements.txt also include other libraries. Will be cleaned up soon. For anaconda setup, please refer to the official Moment-DETR github.

QVHighlights

Training

Training with (only video) and (video + audio) can be executed by running the shell below:

bash tr_detr/scripts/train.sh 
bash tr_detr/scripts/train_audio.sh

Best validation accuracy is yielded at the last epoch.

Inference Evaluation and Codalab Submission for QVHighlights

Once the model is trained, hl_val_submission.jsonl and hl_test_submission.jsonl can be yielded by running inference.sh.

bash tr_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash tr_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'

where direc is the path to the saved checkpoint. For more details for submission, check standalone_eval/README.md.

TVSum

Training with (only video) and (video + audio) can be executed by running the shell below:

bash tr_detr/scripts/tvsum/train_tvsum.sh 
bash tr_detr/scripts/tvsum/train_tvsum_audio.sh

Best results are stored in 'results_[domain_name]/best_metric.jsonl'.

Cite TR-DETR (TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection)

If you find this repository useful, please use the following entry for citation.

@article{Sun_Zhou_Chen_Xie_2024, 
  title={TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection}, 
  volume={38}, 
  url={https://ojs.aaai.org/index.php/AAAI/article/view/28304}, 
  DOI={10.1609/aaai.v38i5.28304}, 
  abstractNote={Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at https://github.com/mingyao1120/TR-DETR.}, 
  number={5}, 
  journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
  author={Sun, Hao and Zhou, Mingyao and Chen, Wenjing and Xie, Wei}, 
  year={2024}, 
  month={Mar.}, 
  pages={4998-5007} 
}

LICENSE

The annotation files and many parts of the implementations are borrowed Moment-DETR and QD-DETR. Following, our codes are also under MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint/[V_SOTA]hl-video_tef-exp-2023_07_24_20_09_00

checkpoint/[V_SOTA]hl-video_tef-exp-2023_07_24_20_09_00

data

data

standalone_eval

standalone_eval

tr_detr

tr_detr

utils

utils

LICENSE

LICENSE

README.md

README.md

Repository files navigation

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection (AAAI 2024 Paper)

Prerequisites

QVHighlights

Training

Inference Evaluation and Codalab Submission for QVHighlights

TVSum

Cite TR-DETR (TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection)

LICENSE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
checkpoint/[V_SOTA]hl-video_tef-exp-2023_07_24_20_09_00		checkpoint/[V_SOTA]hl-video_tef-exp-2023_07_24_20_09_00
data		data
standalone_eval		standalone_eval
tr_detr		tr_detr
utils		utils
LICENSE		LICENSE
README.md		README.md

License

mingyao1120/TR-DETR

Folders and files

Latest commit

History

Repository files navigation

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection (AAAI 2024 Paper)

Prerequisites

QVHighlights

Training

Inference Evaluation and Codalab Submission for QVHighlights

TVSum

Cite TR-DETR (TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection)

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages