Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
February 18, 2021 17:03
May 14, 2021 17:55
February 18, 2021 15:41
February 19, 2021 20:50
May 13, 2021 13:09
May 14, 2021 17:47

Download and pre-process ROAD dataset

Here we provide the download and pre-processing instructions for the ROAD dataset, that is released through our TPAMI paper: ROAD: The ROad event Awareness Dataset for Autonomous Driving and uses 3D-RetinaNet code as a baseline, which also contains the evaluation code. The ROAD dataset will be used within The ROAD challenge.

Main Features

  • Action annotations for human as well as other road agents, e.g. Turning-right, Moving-away etc.
  • Agent type labels, e.g. Pedestrian, Car, Cyclist, Large-Vehicle, Emergency-Vehicle etc.
  • Semantic location labels of the location of agent, e.g. in vehicle lane, in right pavement etc.
  • 122K frames from 22 videos annotated, each video is 8 mins long on an average.
  • track/tube id annotated for every bounding box on every frame for every agent in the scene.
  • 7K tubes/tracks of individual agents.
    • Each tube consists on average of approximately 80 bounding boxes linked over time.
  • 559K bounding box-level agent labels.
  • 641K and 498K bounding box-level action and location labels.
  • 122k annotated with self/ego-actions of AV as well, e.g. AV-on-the-mov, AV-Stopped, AV-turning-right, AV-Overtaking etc.


ROAD dataset is build upon Oxford Robot Car Dataset (OxRD). If you find the original dataset useful in your work, please cite it using the citation that can be found here.

Similar to the original dataset (OxRD), the ROAD dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and is intended for non-commercial academic use. If you are interested in using the dataset for commercial purposes, please contact original creator OxRD for video content and Fabio and Gurkirt for event annotations.

If you use ROAD dataset, please cite it using the following:

@ARTICLE {singh2022road,
author = {Singh, Gurkirt and Akrigg, Stephen and Di Maio, Manuele and Fontana, Valentina and Alitappeh, Reza Javanmard and Saha, Suman and  Jeddisaravi, Kossar and Yousefi, Farzad and Culley, Jacob and Nicholson, Tom and others},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {ROAD: The ROad event Awareness Dataset for autonomous Driving},
year = {5555},
volume = {},
number = {01},
issn = {1939-3539},
pages = {1-1},
keywords = {roads;autonomous vehicles;task analysis;videos;benchmark testing;decision making;vehicle dynamics},
doi = {10.1109/TPAMI.2022.3150906},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {feb}



We release the annotations created by Visual Artificial Intelligence Laboratory, and the sub-set of pre-processed videos from OxRD. Pre-processing includes demosaic for RGB conversion, ffmpeg for .mp4 conversion and fixing the frame-rate. More details can be found in tar2mp4.

You can download the Train-Val-set videos and corresponding annotations by changing your current directory to the road directory and running the bash file This will automatically download the annotation files and video directory in the current directory (road).


Alternatively, you can download the Train-Val-set videos and annotations from our Google-Drive folder.

The videos of Test-set is released, you can download it from our Google-Drive folder.


The baseline code for 3D-RetinaNet used in the dataset release paper uses sequences of frames as input. Once you have downloaded the videos from Google-Drive, create a folder name road and put the annotations under it, then create another folder named videos under road folder, and put all the videos under the folder named videos. Now, your folder structure should look like this:

        - road_trainval_v1.0.json
        - videos/
            - 2014-06-25-16-45-34_stereo_centre_02
            - 2014-06-26-09-53-12_stereo_centre_02
            - ........

Before extracting the frames, you will need to make sure that you have ffmpeg installed on your machine or your python should include its binaries. If you are using Ubuntu, the following command should be sufficient: sudo apt install ffmpeg.

You can now use to extract the frames. You will need to provide the path to the road folder as an argument:

python <path-to-road-folder>/road/

Now, the road directory should look like this:

        - road_trainval_v1.0.json
        - videos/
            - 2014-06-25-16-45-34_stereo_centre_02
            - 2014-06-26-09-53-12_stereo_centre_02
            - ........
        - rgb-images
            - 2014-06-25-16-45-34_stereo_centre_02/
                - 00001.jpg
                - 00002.jpg
                - .........*.jpg
            - 2014-06-26-09-53-12_stereo_centre_02
                - 00001.jpg
                - 00002.jpg
                - .........*.jpg
            - ......../
                - ........*.jpg

Annotation Structure

The annotations for the train and validation split are saved in single json file named road_trainval_v1.0.json, which is located under root directory of the dataset as it can be seen above.

The first level of road_trainval_v1.0.json contains dataset level information like classes of each label type:

  • Here are all the fields: dict_keys(['all_input_labels', 'all_av_action_labels', 'av_action_labels', 'agent_labels', 'action_labels', 'duplex_labels', 'triplet_labels', 'loc_labels', 'db', 'label_types', 'all_duplex_labels', 'all_triplet_labels', 'all_agent_labels', 'all_loc_labels', 'all_action_labels', 'duplex_childs', 'triplet_childs'])
  • all_input_labels: All classes used to annotate the dataset
  • label_types : It is list of all the label types ['agent', 'action', 'loc', 'duplex', 'triplet'].
  • all_av_action_labels: All classes used to annotate AV actions
  • av_action_labels: Classes finally being used for AV actions
  • Remaining fields ending with labels follows the same logic and AV actions described in above line.
  • duplex_childs and triplet_childs contain ids of child classes form agent, action or location labels to construct duplex or triplet labels,
  • duplex is constructed using agent and action classes.
  • event or triplet is constructed using agent, action, and location classes.

Finally, the db field contains all frame and tube level annotations for all the videos:

  • To access annotation for a vides, use db['2014-06-25-16-45-34_stereo_centre_02'], where '2014-06-25-16-45-34_stereo_centre_02' is name of a video.
  • Each video annotation comes with following fields
    • ['split_ids', 'agent_tubes', 'action_tubes', 'loc_tubes', 'duplex_tubes', 'triplet_tubes', 'av_action_tubes', 'frame_labels', 'frames', 'numf']
    • split_ids contains the split id assigned this videos out of 'test', 'train_1','val_1',........'val_3'.
    • numf is number of frames in the video.
    • frame_labels is AV-action class ids assigned for each frame of the videos.
    • frames contains frame-level annotations
      • for the each frame of the video, e.g. for '1' frame, frames['1'] contains ['annotated', 'rgb_image_id', 'width', 'height', 'av_action_ids', 'annos', 'input_image_id']
      • annotated is flag to indicate if frame is annotated or not.
      • rgb_image_id = input_image_id is id of physical frame extracted by ffmpeg.
      • av_action_ids: AV action labels
      • annos : contains annotations of a frame with bounding boxes with unique keys like annos['4585'] =['b19111',] from a frame number '4585', which is unique in whole dataset.
      • annos['b09'] has following keys dict_keys(['box', 'agent_ids', 'loc_ids', 'action_ids', 'duplex_ids', 'triplet_ids', 'tube_uid'])
        • box is normalized (0,1) bounding box coordinate with xmin, ymin, xmax, ymax
        • tube_uid id of the agent tube it belongs.
        • fields ending with _ids contains class ids of respective label type.
    • The fields ending tubes contains tube-level annotation of respective label-type.
      • for example, db['2014-06-25-16-45-34_stereo_centre_02']['agent_tubes'] contains tubes with fields like ['544e13cc-001-01', 'a074d1bf-001-01', 'e97b3e4c-001-01', 'edb6d66a-005-01', .........]
      • each tube has following fields dict_keys(['label_id', 'annos'])
        • label_id is class id from respective label type.
        • annos is dictinary with keys made of frame_ids, e.g. ['agent_tubes']['10284a58-002-01']['annos'].keys() >> dict_keys(['4585', '4586', ......, '4629', '4630'])
        • annos['4585'] = 'b19111' stores unique key which points to frame-level annotations in frame number '4585'.


Now that you have the dataset and are familiar with its structure, you are ready to train or test 3D-RetinaNet, which contains a dataloader class and evaluation scripts required for all the tasks in ROAD dataset.

You can find the evaluation functions in 3D-RetinaNet/modules/

Plotting annotations

Please note that you need to setup the dataset structure like it is set in the Frame-extraction section! In order to inspect the dataset, you can use to plot the annotations for the videos in road/rgb-images/. This will dump plotted images, then you can use ffmpeg to convert them into a video like shown below: