Skip to content
ROCK model for Knowledge-Based VQA in Videos
Python Shell
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.
Data Concepts and faces features addded. Sep 23, 2019
KnowledgeBase Update system path. Sep 24, 2019
KnowledgeRetrieval Removed training scores computation. Dec 2, 2019
VideoReasoning Minor changes after checking visual models. Sep 25, 2019
.gitignore Readme update Dec 2, 2019 Readme update Dec 2, 2019 Knowlegde Base module Sep 12, 2019 ROCK-caption added. Sep 24, 2019

ROCK: Retrieval Over Collected Knowledge

ROCK is a model for Knowledge-Based Visual Question Answering in Videos. It is the first model that incorporates the use of external knowledge to answer questions about video clips. ROCK is based on the availability of language instances representing the knowledge in a certain universe (e.g. a TV show), for which it constructs a knowledge base (KB). The model retrieves instances from the KB and fuses them with language and spatio-temporal video representations for reasoning and answer prediction.


Check the project website here.


  1. Clone the repository:

    git clone

  2. Download the KnowIT VQA dataset and save the csv files in Data/.

  3. Install dependencies:

    • Python 3.6
    • numpy (conda install -c anaconda numpy)
    • pandas (conda install -c anaconda pandas)
    • sklearn (conda install -c anaconda scikit-learn)
    • visdom (conda install -c conda-forge visdom)
    • pytorch 0.4.1 (conda install pytorch=0.4.1 cuda90 -c pytorch)
    • torchvision (conda install torchvision)
    • pytorch-pretrained-bert 0.4.0 (conda install -c conda-forge pytorch-pretrained-bert=0.4.0)

Note: Make sure to install pytorch-pretrained-bert instead of its newest version pytorch-transformers.

ROCK Model

ROCK addresses KBVQA as a multiple choice problem, in which each question is associated with multiple candidate answers, only one of them being correct. The model has 3 main modules:

  1. Knowledge Base Construction: creates a knowledge base using the samples from the dataset.

  2. Knowledge Retrieval: accesses the knowledge base and finds the best instance for a specific question and answers.

  3. Video Reasoning: uses the information from the video, subtitles and retrieved knoweldge to predict the correct answer.

Knowledge Base

To create the knowledge base:

sh KnowledgeBase/

The files reason_idx_to_kb*.pckl and reason_kb_dict.pckl containing the instances of the knowledge base are saved in Data/KB/.

Knowledge Retrieval

To train the knowledge retrieval module:

sh KnowledgeRetrieval/

The BertScoring model is saved in Training/KnowledgeRetrieval/.

Note: The matching scores for test and validation sets take a long time to compute. You can download our pre-computed scores from here and save them in Data/.

Video Reasoning

We proposed 4 different models using different visual features extracted from the video clips: ROCK-image, ROCK-concepts, ROCK-facial and ROCK-caption.

Data preparation
  1. Download the video frames and save them in Data/Frames/ directory.

  2. Compute language embeddings: python VideoReasoning/

  3. (For ROCK-concepts only) Download the pre-computed visual concepts (77.2GB) from the server (apas:/home/noagarcia/ROCK-concepts) and save the file in Data/Concepts/. Visual concepts were generated with this code.

  4. (For ROCK-facial only) Download the pre-computed list of faces per frame from here (240.3MB) and save the file in Data/Faces/. Character faces were recognized with this code.

  5. (For ROCK-caption only) Download the pre-computed captions per frame from here (21.1MB) and save the file in Data/Captions/. Captions were generated with this code.

Model training and evaluation
  • For ROCK-image:
python VideoReasoning/ --vision image
  • For ROCK-concepts:
python VideoReasoning/ --vision concepts
  • For ROCK-facial:
python VideoReasoning/ --vision facial
  • For ROCK-caption:
python VideoReasoning/ --use_captions
Pretrained weigths

Our pretrained models are available to download from:

  • BertReasoning from here. Save the file in Training/VideoReasoning/BertReasoning_topk5_maxseq256.

  • ROCK-image from here. Save the file in Training/VideoReasoning/AnswerPrediction_image.

  • ROCK-concepts from here. Save the file in Training/VideoReasoning/AnswerPrediction_concepts.

  • ROCK-facial from here. Save the file in Training/VideoReasoning/AnswerPrediction_facial.

  • ROCK-caption from here. Save the file in Training/VideoReasoning/AnswerPrediction_caption.


Accuracy on the KnowIT VQA dataset:

Model Vis. Text. Temp. Know. All
ROCK-image 0.658 0.703 0.628 0.644 0.654
ROCK-concepts 0.658 0.703 0.628 0.645 0.654
ROCK-facial 0.658 0.703 0.628 0.644 0.654
ROCK-caption 0.639 0.674 0.605 0.628 0.635


If you find this code useful, please cite our work:

   author    = {Noa Garcia and Mayu Otani and Chenhui Chu and Yuta Nakashima},
   title     = {KnowIT VQA: Answering Knowledge-Based Questions about Videos},
   booktitle = {Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
   year      = {2020},
You can’t perform that action at this time.