Skip to content

haonan3/Neural-Corpus-Indexer-NCI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

[👑 NeurIPS 2022 Outstanding Paper] A Neural Corpus Indexer for Document Retrieval -- NCI (Paper)

made-with-python

What is NCI?

NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:

Model Recall@1 Recall@10 Recall@100 MRR@100
NCI (Ensemble) 70.46 89.35 94.75 77.82
NCI (Large) 66.23 85.27 92.49 73.37
NCI (Base) 65.86 85.20 92.42 73.12
DSI (T5-Base) 27.40 56.60 -- --
DSI (T5-Large) 35.60 62.60 -- --
SEAL (Large) 59.93 81.24 90.93 67.70
ANCE (MaxP) 52.63 80.38 91.31 62.84
BM25 + DocT5Query 35.43 61.83 76.92 44.47

For more information, checkout our publications: https://arxiv.org/abs/2206.02743

Environemnt

[1] Install Anaconda.

[2] Clone repository:

git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI

[3] Create conda environment:

conda env create -f environment.yml
conda activate NCI

[4] Docker:

If necessary, the NCI docker is mzmssg/corpus_env:latest.

Data Process

You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

[1] Dataset Download.

Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.

[2] Semantic Identifier

NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.

[3] Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

NCI uses docTTTTTquery checkpoint to generate synthetic queries. Please refer to docTTTTTquery documentation.

Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').

Evaluation

Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.

Citation

If you find this work useful for your research, please cite:

@article{wang2022neural,
  title={A Neural Corpus Indexer for Document Retrieval},
  author={Wang, Yujing and Hou, Yingyan and Wang, Haonan and Miao, Ziming and Wu, Shibin and Sun, Hao and Chen, Qi and Xia, Yuqing and Chi, Chengmin and Zhao, Guoshuai and others},
  journal={arXiv preprint arXiv:2206.02743},
  year={2022}
}

Acknowledgement

We learned a lot and borrowed some code from the following projects when building NCI.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Other 1.1%