🎬 CASCADE

Decompose and Conquer: Compositional Reasoning for
Zero-Shot Temporal Action Localization

Official Implementation · AAAI 2026

Haoyu Tang¹ · Tianyuan Liang¹ · Han Jiang² · Xuesong Liu³ · Qinghai Zheng⁴ · Yupeng Hu^{1 ✉}

¹ School of Software, Shandong University | ² Xi'an Jiaotong University | ³ University of Glasgow | ⁴ Fuzhou University

✉ Corresponding author

Figure: Overview of the CASCADE framework — Context-Guided Action Filtering → Stage-Aware Decomposition → Stage-wise Confidence Estimation → Compositional Action Reconstruction.

🔥 Updates

[04/2026] Initial release of code and paper.

📖 Introduction

Temporal Action Localization (TAL) aims to identify the category and precise temporal boundaries of actions in untrimmed videos. Current Zero-Shot Temporal Action Localization (ZSTAL) methods—whether training-based or training-free—predominantly rely on a single, unified query to represent an entire action. This strategy is fundamentally ill-suited for complex, real-world activities, as it fails to capture their internal compositional structure and dynamic multi-stage variations across videos.

To address this, we reframe ZSTAL as a compositional reasoning task and propose CASCADE (Context-Aware Staged Action DEcomposition), a novel training-free framework. Inspired by the human cognitive process—perceiving global context, decomposing events into stages, and reconstructing action instances—CASCADE:

Leverages an MLLM to filter irrelevant actions and generate rich, video-specific captions.
Uses an LLM to decompose each caption into temporally ordered key and non-key stages.
Employs the MLLM to perform stage-wise frame-level confidence estimation.
Applies a novel hierarchical merging logic to reconstruct complete action instances from stage segments.

Extensive experiments on THUMOS14 and ActivityNet-1.3 show that CASCADE sets a new state-of-the-art among training-free methods and, most notably, significantly surpasses all prior training-based ZSTAL approaches on ActivityNet-1.3.

✨ Highlights

🏆 Training-Free SOTA: CASCADE outperforms all prior training-free ZSTAL methods on both THUMOS14 and ActivityNet-1.3.
🚀 Surpasses Training-Based Methods: On ActivityNet-1.3, our training-free approach outperforms all training-based ZSTAL competitors (e.g., +7.0% mAP over DeTAL under the 75/25 split).
🧠 Compositional Reasoning: The first ZSTAL framework to explicitly model the internal stage structure of complex actions via LLM-driven decomposition.
🔌 Plug-and-Play: Operates solely with off-the-shelf MLLMs (e.g., Qwen2.5-VL, LLaVA-1.5) and LLMs (e.g., DeepSeek-V3)—no task-specific fine-tuning required.

🏗️ Framework Overview

CASCADE consists of four sequential modules:

Step	Module	Description
1	Context-Guided Action Filtering	MLLM identifies which actions from the predefined set actually occur in the video.
2	Stage-Aware Decomposition	MLLM generates a video-specific caption; LLM decomposes it into ordered key/non-key stages.
3	Stage-wise Confidence Estimation	MLLM computes frame-level confidence scores for each stage in a single batched forward pass.
4	Compositional Action Reconstruction	A hierarchical merging logic fuses stage segments into complete, coherent action instances.

Figure: Illustration of stage-aware localization. Decomposing "Baking cookies" into sub-stages (cutting chocolate → mixing ingredients → baking) yields significantly more precise localization than a single-query approach.

📁 Project Structure

AAAI26-CASCADE/
├── annotation/                        # Annotation files for datasets
│   ├── activity_net.v1-3.min.json     # ActivityNet-1.3 annotations
│   └── thumos_anno_action.json        # THUMOS14 annotations
├── code/                              # Core implementation of CASCADE
│   ├── 1category.py                   # Step 1: Context-Guided Action Filtering
│   ├── 2caption.py                    # Step 2: Video-specific caption generation
│   ├── 3stage.py                      # Step 3: Stage-Aware Decomposition (LLM)
│   ├── 4localization.py               # Step 4: Stage-wise Confidence Estimation
│   ├── 5merge.py                      # Step 5: Compositional Action Reconstruction
│   └── 6value.py                      # Step 6: Evaluation & metrics
├── paper/
│   ├── 29083.pdf                      # Paper PDF
│   ├── framework.png                  # Framework overview figure
│   ├── scores.png                     # Performance score visualization
│   └── video-stage.png                # Video stage decomposition illustration
├── LICENSE
└── README.md

⚙️ Installation

Requirements

Python >= 3.8
PyTorch >= 2.0
CUDA-compatible GPU (experiments run on NVIDIA A100 80GB)

Setup

# Clone the repository
git clone https://github.com/iLearn-Lab/AAAI26-CASCADE.git
cd AAAI26-CASCADE

# Create a virtual environment (recommended)
conda create -n cascade python=3.10 -y
conda activate cascade

# Install dependencies
pip install -r requirements.txt

Model Weights

CASCADE is training-free and relies on the following off-the-shelf pretrained models:

Role	Model	Source
Backbone MLLM (option 1)	Qwen2.5-VL-7B	HuggingFace
Backbone MLLM (option 2)	LLaVA-1.5-7B	HuggingFace
Stage Decomposition LLM	DeepSeek-V3	DeepSeek API

📊 Dataset / Benchmark

CASCADE is evaluated on two standard ZSTAL benchmarks:

THUMOS14

20 sports action classes, 200 training / 213 test videos.
Evaluation at tIoU thresholds: {0.3, 0.4, 0.5, 0.6, 0.7}.

ActivityNet-1.3

200 action classes, ~20K videos across train/val/test splits.
Evaluation at tIoU thresholds: {0.5, 0.75, 0.95}.

Both datasets are evaluated under 75%/25% and 50%/50% seen/unseen class splits, averaged over 10 random splits for statistical robustness.

Please refer to the official dataset pages for download instructions:

🚀 Usage

Inference

# Step 1 — Context-Guided Action Filtering
python code/1category.py --dataset activitynet --split 75_25

# Step 2 — Video-specific Caption Generation
python code/2caption.py --dataset activitynet --split 75_25

# Step 3 — Stage-Aware Decomposition (LLM)
python code/3stage.py --dataset activitynet --split 75_25

# Step 4 — Stage-wise Confidence Estimation
python code/4localization.py --dataset activitynet --backbone qwen --split 75_25

# Step 5 — Compositional Action Reconstruction
python code/5merge.py --dataset activitynet --split 75_25

# Step 6 — Evaluation
python code/6value.py --dataset activitynet --split 75_25

⚠️ Note: Detailed scripts and configuration files will be released shortly. Please check back or watch the repository for updates.

📈 Main Results

THUMOS14

Method	Training	0.3	0.4	0.5	0.6	0.7	mAP
Eff-Prompt	✓	39.7	31.6	23.0	14.9	7.5	23.3
STALE	✓	40.5	32.3	23.5	15.3	7.6	23.8
DeTAL	✓	39.8	33.6	25.9	17.4	9.9	25.3
T3AL	✗	19.2	12.7	7.4	4.4	2.2	9.2
FreeZAD	✗	21.2	13.6	8.3	4.7	2.5	10.0
ZEAL	✗	22.1	16.1	11.0	5.7	3.0	11.6
CASCADE-Qwen (Ours)	✗	23.9	17.5	11.7	7.6	4.3	13.0
CASCADE-LLaVA (Ours)	✗	23.8	17.9	14.0	7.6	5.1	13.7

Results under the 75%/25% split. Training-free methods in bold.

ActivityNet-1.3

Method	Training	0.5	0.75	0.95	mAP
DeTAL	✓	39.3	26.4	5.0	25.8
FreeZAD	✗	33.5	17.5	3.9	18.3
CASCADE-Qwen (Ours)	✗	41.4	27.4	7.1	25.3
CASCADE-LLaVA (Ours)	✗	52.7	36.7	8.3	32.6

Results under the 75%/25% split. CASCADE-LLaVA surpasses ALL training-based competitors.

🎨 Visualization

The figure below shows CASCADE's localization process for the action "Baking cookies" on ActivityNet-1.3. The action is decomposed into four semantic stages (S1: Preparing Ingredients, S2: Mixing Ingredients, S3: Melting Chocolate, S4: Baking). Frame-level confidence scores are computed per stage, thresholded to yield raw proposals (P), and then fused via hierarchical merging into a final prediction (P̂) that closely matches the ground truth (GT).

✅ TODO

Release full inference code
Release detailed configuration files and prompts
Release annotation files
Add demo / visualization script
Release pre-computed results

📝 Citation

If you find CASCADE useful in your research, please consider citing our paper:

@inproceedings{tang2026cascade,
  title     = {Decompose and Conquer: Compositional Reasoning for Zero-Shot Temporal Action Localization},
  author    = {Tang, Haoyu and Liang, Tianyuan and Jiang, Han and Liu, Xuesong and Zheng, Qinghai and Hu, Yupeng},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026}
}

🙏 Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No. 62206156, No. 62306074, No. 62276155, No. 72004127, No. 62206157); the NSF of Shandong Province (No. ZR2024QF104, No. ZR2021MF040, No. ZR2022QF047); the Key R&D Program of Shandong Province (No. 2022CXGC020107); the Natural Science Basic Research Plan in Shaanxi Province (No. 2025JCJCQN-091); and the Key R&D Program of Shaanxi (No. 2024GX-YBXM-556).

📄 License

This project is released under the terms of the LICENSE file included in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 CASCADE

Decompose and Conquer: Compositional Reasoning for
Zero-Shot Temporal Action Localization

📋 Table of Contents

🔥 Updates

📖 Introduction

✨ Highlights

🏗️ Framework Overview

📁 Project Structure

⚙️ Installation

Requirements

Setup

Model Weights

📊 Dataset / Benchmark

🚀 Usage

Inference

📈 Main Results

THUMOS14

ActivityNet-1.3

🎨 Visualization

✅ TODO

📝 Citation

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
annotation		annotation
code		code
paper		paper
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

🎬 CASCADE

Decompose and Conquer: Compositional Reasoning forZero-Shot Temporal Action Localization

📋 Table of Contents

🔥 Updates

📖 Introduction

✨ Highlights

🏗️ Framework Overview

📁 Project Structure

⚙️ Installation

Requirements

Setup

Model Weights

📊 Dataset / Benchmark

🚀 Usage

Inference

📈 Main Results

THUMOS14

ActivityNet-1.3

🎨 Visualization

✅ TODO

📝 Citation

🙏 Acknowledgements

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Decompose and Conquer: Compositional Reasoning for
Zero-Shot Temporal Action Localization

Packages