Official Implementation · AAAI 2026
Haoyu Tang1 · Tianyuan Liang1 · Han Jiang2 · Xuesong Liu3 · Qinghai Zheng4 · Yupeng Hu1 ✉
1 School of Software, Shandong University | 2 Xi'an Jiaotong University | 3 University of Glasgow | 4 Fuzhou University
✉ Corresponding author
Figure: Overview of the CASCADE framework — Context-Guided Action Filtering → Stage-Aware Decomposition → Stage-wise Confidence Estimation → Compositional Action Reconstruction.
- Updates
- Introduction
- Highlights
- Framework Overview
- Project Structure
- Installation
- Dataset
- Usage
- Main Results
- Visualization
- TODO
- Citation
- Acknowledgements
- License
- [04/2026] Initial release of code and paper.
Temporal Action Localization (TAL) aims to identify the category and precise temporal boundaries of actions in untrimmed videos. Current Zero-Shot Temporal Action Localization (ZSTAL) methods—whether training-based or training-free—predominantly rely on a single, unified query to represent an entire action. This strategy is fundamentally ill-suited for complex, real-world activities, as it fails to capture their internal compositional structure and dynamic multi-stage variations across videos.
To address this, we reframe ZSTAL as a compositional reasoning task and propose CASCADE (Context-Aware Staged Action DEcomposition), a novel training-free framework. Inspired by the human cognitive process—perceiving global context, decomposing events into stages, and reconstructing action instances—CASCADE:
- Leverages an MLLM to filter irrelevant actions and generate rich, video-specific captions.
- Uses an LLM to decompose each caption into temporally ordered key and non-key stages.
- Employs the MLLM to perform stage-wise frame-level confidence estimation.
- Applies a novel hierarchical merging logic to reconstruct complete action instances from stage segments.
Extensive experiments on THUMOS14 and ActivityNet-1.3 show that CASCADE sets a new state-of-the-art among training-free methods and, most notably, significantly surpasses all prior training-based ZSTAL approaches on ActivityNet-1.3.
- 🏆 Training-Free SOTA: CASCADE outperforms all prior training-free ZSTAL methods on both THUMOS14 and ActivityNet-1.3.
- 🚀 Surpasses Training-Based Methods: On ActivityNet-1.3, our training-free approach outperforms all training-based ZSTAL competitors (e.g., +7.0% mAP over DeTAL under the 75/25 split).
- 🧠 Compositional Reasoning: The first ZSTAL framework to explicitly model the internal stage structure of complex actions via LLM-driven decomposition.
- 🔌 Plug-and-Play: Operates solely with off-the-shelf MLLMs (e.g., Qwen2.5-VL, LLaVA-1.5) and LLMs (e.g., DeepSeek-V3)—no task-specific fine-tuning required.
CASCADE consists of four sequential modules:
| Step | Module | Description |
|---|---|---|
| 1 | Context-Guided Action Filtering | MLLM identifies which actions from the predefined set actually occur in the video. |
| 2 | Stage-Aware Decomposition | MLLM generates a video-specific caption; LLM decomposes it into ordered key/non-key stages. |
| 3 | Stage-wise Confidence Estimation | MLLM computes frame-level confidence scores for each stage in a single batched forward pass. |
| 4 | Compositional Action Reconstruction | A hierarchical merging logic fuses stage segments into complete, coherent action instances. |
Figure: Illustration of stage-aware localization. Decomposing "Baking cookies" into sub-stages (cutting chocolate → mixing ingredients → baking) yields significantly more precise localization than a single-query approach.
AAAI26-CASCADE/
├── annotation/ # Annotation files for datasets
│ ├── activity_net.v1-3.min.json # ActivityNet-1.3 annotations
│ └── thumos_anno_action.json # THUMOS14 annotations
├── code/ # Core implementation of CASCADE
│ ├── 1category.py # Step 1: Context-Guided Action Filtering
│ ├── 2caption.py # Step 2: Video-specific caption generation
│ ├── 3stage.py # Step 3: Stage-Aware Decomposition (LLM)
│ ├── 4localization.py # Step 4: Stage-wise Confidence Estimation
│ ├── 5merge.py # Step 5: Compositional Action Reconstruction
│ └── 6value.py # Step 6: Evaluation & metrics
├── paper/
│ ├── 29083.pdf # Paper PDF
│ ├── framework.png # Framework overview figure
│ ├── scores.png # Performance score visualization
│ └── video-stage.png # Video stage decomposition illustration
├── LICENSE
└── README.md
- Python >= 3.8
- PyTorch >= 2.0
- CUDA-compatible GPU (experiments run on NVIDIA A100 80GB)
# Clone the repository
git clone https://github.com/iLearn-Lab/AAAI26-CASCADE.git
cd AAAI26-CASCADE
# Create a virtual environment (recommended)
conda create -n cascade python=3.10 -y
conda activate cascade
# Install dependencies
pip install -r requirements.txtCASCADE is training-free and relies on the following off-the-shelf pretrained models:
| Role | Model | Source |
|---|---|---|
| Backbone MLLM (option 1) | Qwen2.5-VL-7B | HuggingFace |
| Backbone MLLM (option 2) | LLaVA-1.5-7B | HuggingFace |
| Stage Decomposition LLM | DeepSeek-V3 | DeepSeek API |
CASCADE is evaluated on two standard ZSTAL benchmarks:
THUMOS14
- 20 sports action classes, 200 training / 213 test videos.
- Evaluation at tIoU thresholds: {0.3, 0.4, 0.5, 0.6, 0.7}.
ActivityNet-1.3
- 200 action classes, ~20K videos across train/val/test splits.
- Evaluation at tIoU thresholds: {0.5, 0.75, 0.95}.
Both datasets are evaluated under 75%/25% and 50%/50% seen/unseen class splits, averaged over 10 random splits for statistical robustness.
Please refer to the official dataset pages for download instructions:
# Step 1 — Context-Guided Action Filtering
python code/1category.py --dataset activitynet --split 75_25
# Step 2 — Video-specific Caption Generation
python code/2caption.py --dataset activitynet --split 75_25
# Step 3 — Stage-Aware Decomposition (LLM)
python code/3stage.py --dataset activitynet --split 75_25
# Step 4 — Stage-wise Confidence Estimation
python code/4localization.py --dataset activitynet --backbone qwen --split 75_25
# Step 5 — Compositional Action Reconstruction
python code/5merge.py --dataset activitynet --split 75_25
# Step 6 — Evaluation
python code/6value.py --dataset activitynet --split 75_25
⚠️ Note: Detailed scripts and configuration files will be released shortly. Please check back or watch the repository for updates.
| Method | Training | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | mAP |
|---|---|---|---|---|---|---|---|
| Eff-Prompt | ✓ | 39.7 | 31.6 | 23.0 | 14.9 | 7.5 | 23.3 |
| STALE | ✓ | 40.5 | 32.3 | 23.5 | 15.3 | 7.6 | 23.8 |
| DeTAL | ✓ | 39.8 | 33.6 | 25.9 | 17.4 | 9.9 | 25.3 |
| T3AL | ✗ | 19.2 | 12.7 | 7.4 | 4.4 | 2.2 | 9.2 |
| FreeZAD | ✗ | 21.2 | 13.6 | 8.3 | 4.7 | 2.5 | 10.0 |
| ZEAL | ✗ | 22.1 | 16.1 | 11.0 | 5.7 | 3.0 | 11.6 |
| CASCADE-Qwen (Ours) | ✗ | 23.9 | 17.5 | 11.7 | 7.6 | 4.3 | 13.0 |
| CASCADE-LLaVA (Ours) | ✗ | 23.8 | 17.9 | 14.0 | 7.6 | 5.1 | 13.7 |
Results under the 75%/25% split. Training-free methods in bold.
| Method | Training | 0.5 | 0.75 | 0.95 | mAP |
|---|---|---|---|---|---|
| DeTAL | ✓ | 39.3 | 26.4 | 5.0 | 25.8 |
| FreeZAD | ✗ | 33.5 | 17.5 | 3.9 | 18.3 |
| CASCADE-Qwen (Ours) | ✗ | 41.4 | 27.4 | 7.1 | 25.3 |
| CASCADE-LLaVA (Ours) | ✗ | 52.7 | 36.7 | 8.3 | 32.6 |
Results under the 75%/25% split. CASCADE-LLaVA surpasses ALL training-based competitors.
The figure below shows CASCADE's localization process for the action "Baking cookies" on ActivityNet-1.3. The action is decomposed into four semantic stages (S1: Preparing Ingredients, S2: Mixing Ingredients, S3: Melting Chocolate, S4: Baking). Frame-level confidence scores are computed per stage, thresholded to yield raw proposals (P), and then fused via hierarchical merging into a final prediction (P̂) that closely matches the ground truth (GT).
- Release full inference code
- Release detailed configuration files and prompts
- Release annotation files
- Add demo / visualization script
- Release pre-computed results
If you find CASCADE useful in your research, please consider citing our paper:
@inproceedings{tang2026cascade,
title = {Decompose and Conquer: Compositional Reasoning for Zero-Shot Temporal Action Localization},
author = {Tang, Haoyu and Liang, Tianyuan and Jiang, Han and Liu, Xuesong and Zheng, Qinghai and Hu, Yupeng},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026}
}This work was supported in part by the National Natural Science Foundation of China (No. 62206156, No. 62306074, No. 62276155, No. 72004127, No. 62206157); the NSF of Shandong Province (No. ZR2024QF104, No. ZR2021MF040, No. ZR2022QF047); the Key R&D Program of Shandong Province (No. 2022CXGC020107); the Natural Science Basic Research Plan in Shaanxi Province (No. 2025JCJCQN-091); and the Key R&D Program of Shaanxi (No. 2024GX-YBXM-556).
This project is released under the terms of the LICENSE file included in this repository.


