-
Notifications
You must be signed in to change notification settings - Fork 2
Neural module networks
In traditional NLP approaches, question answering is generally formulated as a two-step process: first, we map from language to some kind of structured representation of meaning (
Work on question answering with neural nets has mostly focused on image-based QA. (There's some work on answering non-compositional factoid questions about databases, but it doesn't seem to be competitive with classical approaches.) Unsurprisingly, most neural QA so far has a lot less structure than "NLP-style" QA---we have some kind of recurrent sequence model slurp up an embedded representation of the image and question text, then immediately start decoding an answer.
While neural nets are essential for the vision tasks underlying QA, it seems likely that a little "NLP-style" structure would be valuable. Consider answering questions like "what is left of the mug?", "what is above the thing that is left of the mug?", "what is left of the thing that is above the thing that is left of the mug?" etc. Suppose it takes two layers in a neural network to compute a general spatial relationship like "left of" or "left of the thing that is above". Then in a feedforward architecture, either (1) some pair of layers in the network must be capable of computing all generalized spatial relationships, or (2) the network may need to be arbitrarily deep, and each place the "left of" relation gets computed will have to be learned separately. The situation is potentially even worse in an RNN that goes straight from input text to output text: with "what is left of the mug", the network doesn't even know where to start its glance until the last word in the sentence, and which point there is only a single timestep in which to immediately produce the answer.
[N.B. Obviously in this paragraph I'm making pretty strong assumptions about the structure of the computations performed by a neural QA system, but I think these are consistent with the intuitions we have about how convnets and RNNs work. The RNN problem mentioned above can be fixed somewhat we allow the recurrence to run for multiple timesteps before answering, but there's still the problem of using a fully-connected sequence model to perform a heterogeneous set of vision tasks, which I haven't seen anyone do before.]
What does an NLP-style approach to the visual question answering problem look like? Again we're going to predict a structured query representation---something like name(above(left_of(mug)))
---but unlike with databases, both the parsing and execution steps are hard. So we need one model that turns "what is above the thing that's left of the mug" into name(above(left_of(mug)))
, and another model that evaluates name(above(left_of(mug)))
against an image.
The first of these tasks can be solved with standard structured prediction machinery. For the second, I propose to use the query as a blueprint for dynamically assembling a collection of neural "modules" into a network that maps from images to answers.
Intuitively, we're going to separately instantiate a mug detection network, a "left-of" detection network, and a classification network. When given a structure like name(left-of(mug))
, we'll feed the input image into the mug detection network, take the output of that network and feed it as input to the left-of network, and so on. When we see name(left-of(left-of(mug)))
we wind up with two identical copies of the left-of network, one feeding its output into the other. Call the whole composed thing a "neural module network".
Formally, we'll assume a fixed list of functions, each with an input type and an output type. We can informally call these types "entities", "integers", "truth values" or whatever; in practice all messages between networks will be vectors. All we need to enforce is that if a query type-checks, every output-to-input connection between modules involves a vector of the correct size. In the running example, for instance, the left-of module has to produce an output the same size as its input for left_of(left_of(...))
to be well-formed.
Given a training set with a bunch of queries, their inputs, and their outputs, we can instantiate each observed configuration of modules, tie parameters together across model instances, and train with backprop. With appropriate training data (and not-too-overfit modules), we expect to get correct answers out of novel queries---using a fully assembled neural network that was never used during training.
Other advantages:
-
We can specify some layers by hand to guide the modules toward useful semantics for the inter-module messages. If we want "entity" messages to look like pixel masks over the input image, it might help to define intersection and union ("how many cats or dogs?") as elementwise min and max respectively.
-
We can have tremendous variation in the kind of computation performed by each module. Low level layers (e.g. mapping from the input image to "entity" messages) are probably full object detection networks. Intermediate things like "next-to" might be small, fully-connected networks. For certain kinds of numerical operations (e.g. counting) we might even want to instantiate little LSTMs in place and run them until they decide they're done.
-
We can individually pre-train some modules in isolation. This is especially exciting in a scenario like the following: generate both real and synthetic data and train a single NMN to handle both properly for some classification task. Now generate a bunch of synthetic data for a counting task, learn a counting module, and immediately generalize the ability to count to real images.
As described, an NMN is a particular kind of recurrent network with (1) a tree-shaped recurrence structure, (2) an unusually heavy computation at each step, and (3) different computations at different steps. In this respect, NMNs are close relatives of recurrent CNNs [Pinheiro 14] and recursive neural networks [Socher variously]. On the whole, this actually seems like a relatively small step from things that are going on in structured-RNN-land already---the main novelty is just using a different model to predict structures on the fly.
Viewed another way, the whole NMN project is just a particularly ambitious parameter tying scheme across a large collection of networks, each trained on relatively few examples. (This view underplays the importance of being able to get sensible outputs from novel module configurations.)
If we're building these queries from natural language expressions, we expect that some of the node labels will be quite sparse. (How many times do I expect to see the word "alpaca" in the dataset?) Thus we probably don't want to instantiate a totally different network for each predicate. Instead, we'll have some discrete set of networks (based on clustering? syntactic type? just one?) parameterized by a dense representation of the actual predicate. Right now I'm imagining that each predicate will be used to predict weights for a shared set of filters.
-
Arithmetic: done (& better than LSTM baseline)
-
Synthetic shapes data: done
-
Real images (VQA): TODO
- Jointly learning the semantic parser and the network parameters. Various levels of sophistication here---fixed query structure but uncertainty about node labels; k-best list of parses with uncertainty over query choice; a full hypergraph of candidate queries with uncertainty at each node.