Skip to content

Typo-Robust Sentence Representation Learning for Dense Retrieval

License

Notifications You must be signed in to change notification settings

panuthept/DST-DenseRetrieval

Repository files navigation

Dual Self-Teaching (DST) for Dense Retrieval

This repository contains the code for the paper Typo-Robust Representation Learning for Dense Retrieval, ACL 2023.

Installation

We provide a setup.sh script to install this repository and python dependency packages, run the following command:

sh setup.sh

(Optional) To evaluate the models using our evaluation script, you'll need to install the trec_eval.

git clone https://github.com/usnistgov/trec_eval.git
cd trec_eval
make

Download Model Checkpoints

We provide our finetuned model checkpoints for BERT-based DST-DPR and CharacterBERT-based DST-DPR.

Train Models

In case you want to train the models from scratch, we provide the training script as follows:

To train the BERT-based DST-DPR model, run the following command:

sh scripts/train_bert.sh

To train the CharacterBERT-based DST-DPR model, download the pre-trained CharacterBERT with this link, and run the following command:

sh scripts/train_characterbert.sh

Evaluation

In this section, we describe the steps to evaluate the BERT-based DST-DPR model on the MS MARCO and DL-typo passage ranking datasets.

First, we need to encode the passages and queries into dense vectors using the trained models, then retrieve the top-k passages for each query. To do so, run the following command:

sh scripts/retrieve_bert.sh

This should generate the msmarco_bert_embs folder containing dense vectors of passages and queries, and rank_bert folder containing the top-k passages for each query. To obtain the evaluation results, using the following command:

sh scripts/eval_bert.sh

Likewise, to evaluate the CharacterBERT-based DST-DPR model, use retrieve_characterbert.sh and eval_characterbert.sh.

About

Typo-Robust Sentence Representation Learning for Dense Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages