This repository contains the code for the paper Typo-Robust Representation Learning for Dense Retrieval, ACL 2023.
We provide a setup.sh
script to install this repository and python dependency packages, run the following command:
sh setup.sh
(Optional) To evaluate the models using our evaluation script, you'll need to install the trec_eval.
git clone https://github.com/usnistgov/trec_eval.git
cd trec_eval
make
We provide our finetuned model checkpoints for BERT-based DST-DPR and CharacterBERT-based DST-DPR.
In case you want to train the models from scratch, we provide the training script as follows:
To train the BERT-based DST-DPR model, run the following command:
sh scripts/train_bert.sh
To train the CharacterBERT-based DST-DPR model, download the pre-trained CharacterBERT with this link, and run the following command:
sh scripts/train_characterbert.sh
In this section, we describe the steps to evaluate the BERT-based DST-DPR model on the MS MARCO and DL-typo passage ranking datasets.
First, we need to encode the passages and queries into dense vectors using the trained models, then retrieve the top-k passages for each query. To do so, run the following command:
sh scripts/retrieve_bert.sh
This should generate the msmarco_bert_embs
folder containing dense vectors of passages and queries, and rank_bert
folder containing the top-k passages for each query.
To obtain the evaluation results, using the following command:
sh scripts/eval_bert.sh
Likewise, to evaluate the CharacterBERT-based DST-DPR model, use retrieve_characterbert.sh
and eval_characterbert.sh
.