This repository contains the code for our AAAI 2017 paper, "Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters"
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

This repository contains the code for our AAAI 2017 paper:

AJ Piergiovanni, Chenyou Fan and Michael S Ryoo
"Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters"
in Proc. AAAI 2017

If you find the code useful for your research, please cite our paper:

        title={Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters},
        author={Piergiovanni, AJ and Fan, Chenyou and Ryoo, Michael S},
        booktitle={Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence},

Temporal Attention Filters

The core of our approach, the temporal attention filters can be found in This file contains all the code to create and apply the attention filters. We provide several different models we tested, such as the which applies either max, mean or sum pooling over the input features. We have the which applies the unlearned pyramid of filters, which create a model to dynamically adjust the filters with an LSTM. Our best performing model, learns a set of attention filters for each activity class.

Activity Classification Experiments

This code is for the activity classification task. We are able to learn latent sub-events with only activity labels, no labels for the sub-events are given. Our model extract per-frame CNN features and learns a set of temporal attention filters on those features. Each filter corresponds to a unique sub-event and our code takes advantage of these sub-events for improved performance on the recognition of activities. model

We tested our models on both the DogCentric dataset as well as the HMDB dataset.

Example Learned Sub-events

We trained our model on the HMDB dataset which contains ~7000 videos of 51 different human activities. Our model learned 3 sub-events for each activity. Here are some of the sub-events our model learned.

For the pushup activity, our model learned two key sub-events: the "moving down" sub-event (frames 18 to 26) and the "pushing up" sub-event (frames 48 to 55).

going down pushing up

The somersault activity, shown here, is a more complex action where a person rotates over their feet: somersault

Our model learned 3 sub-events. Two focused on the intervals where the person is upside-down (frames 31 to 51 and frames 40 to 48).

sub-event 1 sub-event 2

The third sub-event focuses on the person standing up after completing the flip (frames 54 to 60). sub-event 3



Our code has been tested on Ubuntu 14.04 and 16.04 using the Theano version 0.9.0dev4 (but will likely work with other versions) with a Titan X GPU. We also rely on Fuel to help with the HDF5 datasets.


  1. Download the code git clone

  2. Extract features from your dataset. These can be any per-frame feature, such as VGG, or per-segment features, such as C3D applied over many segments, or ITF features. We relied on pretrained models for this step.

  3. Once the features have been extracted, we provide an example script to create an HDF5 file to train the models. shows how we did this. The way we store the features and load them with the file allows us to properly apply masking to videos of different lengths in the same batch.

  4. We provide several example scripts to train the models. trains the models for attention filters shared for all classes. This script will also output the performance of the model and save the final model. trains binary classifiers for each activity learns a set of attention filters.

  5. Once all the binary classifiers have been trained, the script provides an example to load the trained binary classifiers and run on the test set. The results are saved and the script combines those results to create the final predictions.