Skip to content

plandes/deep-qa-rank

Repository files navigation

Deep learning implementation of question/answer for information retrieval.

This is a deep learning implementation of question/answer for information retrieval the Severyn et al. paper "Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks".

Important: for those interested in building on this source, please see the derived works section.

Table of Contents

Problem Definition

This implementation ranks a paragraph for a given user question. Many question/answer (QA) systems use a multi-level classifier to narrow down to an answer as classified by a text span in the original document. The model given in this implementation provides that first course level classification. A subsequent component of a pipeline would then later classify the text span providing the answer to the user. An example of this later pipeline component is to use the BERT model.

Example Input and Output

Example from the trained model:

Question:

How many ministries of the Scottish government does a committee typically correspond to?

Paragraph classified with rank 0 and probability 97.7% (correct paragraph):

Subject Committees are established at the beginning of each parliamentary session, and again the members on each committee reflect the balance of parties across Parliament. Typically each committee corresponds with one (or more) of the departments (or ministries) of the Scottish Government. The current Subject Committees in the fourth Session are: Economy, Energy and Tourism; Education and Culture; Health and Sport; Justice; Local Government and Regeneration; Rural Affairs, Climate Change and Environment; Welfare Reform; and Infrastructure and Capital Investment.

Paragraph classified with rank 1 and probability 96.1% (incorrect paragraph):

The debating chamber of the Scottish Parliament has seating arranged in a hemicycle, which reflects the desire to encourage consensus amongst elected members. There are 131 seats in the debating chamber. Of the total 131 seats, 129 are occupied by the Parliament's elected MSPs and 2 are seats for the Scottish Law Officers – the Lord Advocate and the Solicitor General for Scotland, who are not elected members of the Parliament but are members of the Scottish Government. As such the Law Officers may attend and speak in the plenary meetings of the Parliament but, as they are not elected MSPs, cannot vote. Members are able to sit anywhere in the debating chamber, but typically sit in their party groupings. The First Minister, Scottish cabinet ministers and Law officers sit in the front row, in the middle section of the chamber. The largest party in the Parliament sits in the middle of the semicircle, with opposing parties on either side. The Presiding Officer, parliamentary clerks and officials sit opposite members at the front of the debating chamber.

For this example, the correct paragraph is ranked first. Had the two paragraphs been switched it would have a rank of 1. Read here for more information on ranking.

Corpus

This uses the 1.1 version of the SQuAD corpus. Note that the paper uses a different corpus, which explains the different (but not dissimilar) results. The corpus was created from Wikipedia articles and annotated with questions on a per paragraph basis with text spans including the answer.

Given this project attempts only to classify a paragraph for a query, the answer annotations are not used. While not explicitly explained in the paper, it is assumed for this implementation that the problem is framed as a binary classification. For this reason, this model also uses a binary classification on a question belonging to a paragraph. More specifically, the model says yes when it thinks a particular paragraph P answers a question Q for all question/answer pairs in the data set.

The data set is generated with a somewhat even number of positive to negative examples. The positive examples are created by pairing all questions with their respective paragraph from the corpus. Negative examples are created by pairing paragraphs with questions from other paragraphs at random without replacement. More specifically a paragraph P_i that that contains the answer for question Q_j for all i and j where i != j. These examples create triples in the form:

(paragraph, question, <True | False>)

Corpus Statistics

The SQuAD v1.1 corpus has the following paragraphs and questions:

Data Set Articles Paragraphs Questions
Train 442 18,896 87,599
Test (dev) 42 2,067 10,570
Totals 484 20,963 98,169

The synthesized corpus used for this implementation has the following composition:

Data Set Positive Negative
Train 87,599 75,584
Test (dev) 10,570 10,335
Totals 163,183 20,905

The positive and negative aren't evenly stratified because the number of negative questions are estimated a priori in terms of paragraphs. A future work is to make these numbers more even using an iterative method.

Evaluation

The data set described in the corpus used to train the network using 20% of the training set as a validation. The dev data set was used as a hold out testing set on the trained model, again with a near even positive and negative strata.

Model

The model implements something very similar to the [papar] with a few exceptions. It uses the deep learning neural network architecture pictured below.

Architecture

Features

The set of features were taken from word2vec word vectors and the output of SpaCy. The word vectors were used as the input layer and used zero vectors for out of vocabulary words. Each token of the paragraph and question add the dimension (300) of the word vector with 0 padding for out of vocabulary words. The width of the input layer is the sum of the maximum number of tokens all paragraphs and questions. Zero padding is also used for utterances with token counts less than the max.

The maximum token count for paragraphs is 566 and questions were 34. These counts were taken across both training and development data sets.

Configuration

All parsing, training and testing parameters and configuration are stored in the configuration file. The documentation for each parameter is given as comments in that file.

Changes

This implementation follows the paper pretty closely. However it differs in the following ways:

  • This work used the SQuAD corpus instead of the TREC QA data set from Wang et al.
  • It uses 70 instead of 100 convolution filters for each question and paragraph each. Adding additional filters past 20 did not change the performance with any statistical significance. This is configurable in the nn_model section of the configuration file.
  • Additional SpaCy features in the input layer were used, which included:
    • POS tag
    • Named entity from the SpaCy named entity recognizer (NER). This is configurable in the corpus_token_normalizer in the configuration file.
  • Number of shared named entity as a count is used in the "common" shared layer.
  • The Google pre-trained word2vec word vectors were used, where as Severyn et al. trained their own.

Latent Semantic Analysis Baseline

A baseline was also computed using Latent Semantic Analysis (LSA) and K-Nearest Neighbor as a non-supervised learning task. LSA is an older method and fundamentally different, but it helps to put the results in perspective.

The training set paragraphs were used to train the model, which for LSA meaning to use Singular Value Decomposition (SVD) and then reduce dimensionality for the low rank approximation. The dimensionality used was 200 (see the configuration file).

Across the 18,896 paragraphs used to train the model, 87,599 questions were ranked. Again, given the nature of unsupervised training, the paragraphs only from the training set are used for training to keep at the same level of training data for neural model.

Installing

You need to install the following first:

GNU make is not a must, but everything is automated with make so adding it will make your life easier. If you decide not to, follow the flow of the makefile as duplicate instructions aren't given.

Usage

The project was written to repeat steps for consistent results and easily test hyper parameter tuning and natural language parsing configurations.

Quick Start

For the brave, you can do all steps with make all. Chances are given the fickle nature of Python environments something will fail--but you might get lucky.

Iterative Method for Results

Otherwise, do the following and plan to make coffee:

  1. Install Python virtualenv: pip3 install virtualenv.
  2. Start with a clean slate (if rebuilding): make vaporize.
  3. Create a Python 3.6 virtual environment: make virtualenv. Note a virtual environment isn't necessary, but it is recommended. Otherwise comment out the PYTHON_BIN line in the makefile to use the system default.
  4. Download and parse the corpus: make parse. This downloads the SQuAD corpus the first time (if not found) to data/corpora/squad. Once the corpus is downloaded it is parsed with SpaCy and stored as binary Python pickled files in data/proc.
  5. Extract and store features from the SpaCy parse: make features. This creates features from the parsed data in a fast consumable format for the training phase.
  6. Train the model: make train. This trains the neural network and quits using early stopping when the validation loss increases.
  7. Test the model: make test. This tests the network by computing an accuracy on rank 0 across the test set, which is the SQuAD dev data set.
  8. Compute the rankings: make calcrank. This stores the rank of each question across all paragraphs of the data set, which takes a very long time. Each ranking with all predictions with question/paragraph IDs are stored on disk.
  9. Compute the mean reciprocal rank of the SQuAD dev data set: make calcmrr. This uses the ranking results from the previous step to compute the MRR score.
  10. Compute the same metrics for the baseline: make lsammr.

Note that each step creates it's respective log file, which is dumped to the log directory.

Results

All results were tested on the same set of features created by SpaCy as mentioned in the features section.

Baseline

As described in the LSA Baseline section, the training and evaluation were over ~87.5K questions trained from 18.9K paragraphs. The mean reciprocal rank was 0.1902, meaning the the correct paragraph was ranked as the 5.2567 most probable average. The rank standard deviation was 0.3258.

Neural Net Method

As described in the evaluation section, the trained model was used to rank 10,570 question across 2,067 paragraphs. A mean reciprocal rank of 0.7088 was achieved with a standard deviation or 0.392. Again, taking the reciprocal one can think of the correct paragraph as ranked as the 1.411 most probable classification. It ranks the correct paragraph with rank 0 for the given test data set question/answer pairing with an overall accuracy of 96%.

The paper trained and tested on the TREC QA data set achieves an improved score of 0.8078. The data set and other differences might explain why these results differ. My guess is that it has a lot to do with the corpus over which the word vectors were trained. Further error analysis and testing on the TREC data set might provide answer to this question. However, this work was an exercise and not an academic paper.

Perhaps someone else can run this project on the TREC corpus. If you do, submit a pull request and I'll fold in your code and give you credit for your work.

TREC QA Data Set

The paper cites the corpus they used in their experiments from the Wang et al. "What is the jeopardy model? a quasi-synchronous grammar for qa" paper.

The data set can be downloaded here.

Derived Works

If you build on this work, by this license you must provide me and the original authors in the paper attribution. Doing otherwise goes against the MIT Open Source License, academic norms, and is morally wrong.

Changelog

An extensive changelog is available here.

License

Copyright (c) 2019 Paul Landes

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Deep learning implementation of question/answer for information retrieval.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published