Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
This repository hosts a learning-to-rank retrieval (and re-ranking) pipeline, which is a part of the PGraph project where we study applicability of k-nearest neighbor (k-NN) search methods to IR and QA applications. This project is supported primarily by the NSF grant #1618159 : "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond".
Currently, this repository hosts software to reproduce our CIKM'16 paper: L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search, to appear in proceedings CIKM'16. Detailed instructions are provided on this page. We do our best to provide you the nearly exact set of scripts and apps for reproduction. However, something might have slipped through cracks. If you encounter a problem, feel free to nudge us (e.g., via the GitHub issue tracker for this project).
The proposed research aims to improve a retrieval component of an extractive QA system, which traditionally relies on a filter-and-refine approach. The cheap filtering step uses full- text retrieval to obtain a list of candidates. The refining step applies a more expensive re-ranking model. Usually, the full-text index includes only lexical entries, which may lead to a reduced recall due to the vocabulary mismatch.
The main research hypothesis is that a high-accuracy similarity (k-NN) search powered by a proximity graph can provide a significant improvement over the classic retrieval approach based on inverted indices. In addition, the similarity search can efficiently handle more sophisticated document representations such as vectorial document embeddings, which are not supported by classic inverted indices.
Existing state-of-the-art k-NN search algorithms based on proximity graphs work well in relatively simple domains with a metric distance function (e.g. the Euclidean distance), but they are not sufficiently accurate for complex non-symmetric and non-metric distances. Our central goal is to improve these search methods and adapt them to the NLP domain.
The code written by us (both this repository and NMSLIB) is distributed under Apache 2 license. We also use Apache Lucene, Apache UIMA, Apache Thrift, DKPRO core, and Clear NLP, which have the same license. The RankLib library uses a BSD license. Finally, we use Stanford NLP as well as a modified versions of Manaal Faruqui's retrofitting script. These modules ship with a GNU license.
Last, but not least, there are a number of less important employed packages (see the POM file), which may have their own licenses different from Apache 2.