Leonid Boytsov edited this page Jul 24, 2017 · 22 revisions


This repository hosts a learning-to-rank retrieval (and re-ranking) pipeline, which is a part of the PGraph project where we study applicability of k-nearest neighbor (k-NN) search methods to IR and QA applications. This project is supported primarily by the NSF grant #1618159 : "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond".

Currently, this repository hosts software to reproduce our CIKM'16 paper: L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search, to appear in proceedings CIKM'16. Detailed instructions are provided on this page. We do our best to provide you the nearly exact set of scripts and apps for reproduction. However, something might have slipped through cracks. If you encounter a problem, feel free to nudge us (e.g., via the GitHub issue tracker for this project).

Research objectives

The proposed research aims to improve a retrieval component of an extractive QA system, which traditionally relies on a filter-and-refine approach. The cheap filtering step uses full- text retrieval to obtain a list of candidates. The refining step applies a more expensive re-ranking model. Usually, the full-text index includes only lexical entries, which may lead to a reduced recall due to the vocabulary mismatch.

The main research hypothesis is that a high-accuracy similarity (k-NN) search powered by a proximity graph can provide a significant improvement over the classic retrieval approach based on inverted indices. In addition, the similarity search can efficiently handle more sophisticated document representations such as vectorial document embeddings, which are not supported by classic inverted indices.

Existing state-of-the-art k-NN search algorithms based on proximity graphs work well in relatively simple domains with a metric distance function (e.g. the Euclidean distance), but they are not sufficiently accurate for complex non-symmetric and non-metric distances. Our central goal is to improve these search methods and adapt them to the NLP domain.


The code written by us (both this repository and NMSLIB) is distributed under Apache 2 license. We also use Apache Lucene, Apache UIMA, Apache Thrift, DKPRO core, and Clear NLP, which have the same license. The RankLib library uses a BSD license. Finally, we use Stanford NLP as well as a modified versions of Manaal Faruqui's retrofitting script. These modules ship with a GNU license.

We may also free software components of unknown license, in particular, GIZA++, Sparse Hash, trec_eval.

Last, but not least, there are a number of less important employed packages (see the POM file), which may have their own licenses different from Apache 2.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.