Multimodal Learning and Reasoning for Visual Question Answering

This repository contains the dissertation, presentation slides, project plan, project brief and project poster for my MSc Project at University of Southampton. The dissertation also provides extensive review of various deep learning architectures, including CNN, RNN, LSTM, attention mechanism, visual attention, self-attention, Transformer, BERT and state-of-the-art visual question answering models.

Abstract

Current deep learning systems are very successful in sensory perception and pattern recognition (e.g. object detection and speech recognition). However, they often struggle in tasks with a compositional nature which require more deliberate thinking and multi-step reasoning. In this work, we study how we can improve the reasoning capability of artificial neural networks in the context of visual question answering (VQA) and visual reasoning. These two tasks are challenging multimodal research problem, which requires fine-grained understanding of both image and question, and typically demands the ability to perform multi-step inferences. As part of this work, we develop a new VQA model based on the novel Transformer architecture, utilising both self-attention and co-attention mechanism for dealing with multimodal input. Experimental results demonstrate that Transformer based model can be effective in a visual reasoning task, as our model achieves strong results on both CLEVR and GQA dataset with 98.3% and 56.28% accuracy respectively. Our analysis shows that the model learns to look around the image and iteratively focus on parts of the image that are relevant to finding an answer, indicating that it is capable of performing multi-step reasoning.

Code

Implementation of the baseline models and proposed VQA model, as well as code for the experiments can be found here and here. The model was implemented using Python and PyTorch deep learning library. I also used two frameworks to help with the development of the VQA model - MMF and OpenVQA.

Analysis

Example of visualisation of the reasoning process of the model. For more details, please see the analysis section in the dissertation.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Dissertation		Dissertation
Images		Images
Presentation		Presentation
Project Brief		Project Brief
Project Plan		Project Plan
Project Poster		Project Poster
Research Review		Research Review
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Learning and Reasoning for Visual Question Answering

Abstract

Code

Analysis

About

Releases

Packages

Languages

markvasin/MSc-Project

Folders and files

Latest commit

History

Repository files navigation

Multimodal Learning and Reasoning for Visual Question Answering

Abstract

Code

Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages