Dual Cross Encoder

Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval

Environment Setup

Make sure you have a python>=3.7 env with pytorch installed. Then run the following command to setup the environment.

pip install -e .

Experiments

Training

There are two ways to train the model. One uses the query generation as data augmentation and the other does not.

Note: Our models are trained on 8 V100 GPUs with 32G memory. If you use differenty configurations, please change the parameters in the training scripts accordingly.

w/o data augmentation

MODEL_DIR=/path/to/save/model
bash scripts/train.sh $MODEL_DIR

w/ data augmentation

PRETRAINED_MODEL_DIR=/path/to/save/pretrained/model
MODEL_DIR=/path/to/save/model
bash scripts/pretrain_corpus.sh $PRETRAINED_MODEL_DIR
bash scripts/finetune.sh $MODEL_DIR $PRETRAINED_MODEL_DIR

Encoding

The following code encode the corpus into vectors. The corpus is partitioned into 20 shards due to resource limit.

ENCODE_DIR=/path/to/save/encoding
# encode corpus
for i in $(seq 0 19)
do
bash scripts/encode_corpus_with_query_shard.sh $ENCODE_DIR $i $MODEL_DIR
done

Retrieval

We evaluate the retrieval performance on the following two benchmarks.

MS MARCO

RESULT_DIR=/path/to/save/result

# encode query
bash scripts/encode_dev_query.sh $ENCODE_DIR $MODEL_DIR

# shard search
for i in $(seq 0 19)
do
bash scripts/search_shard.sh $ENCODE_DIR $i
done

# reduce
bash scripts/reduce.sh $ENCODE_DIR $RESULT_DIR

# evaluation
bash scripts/evaluate.sh $RESULT_DIR

TREC DL

YEAR=2019 # 2020
RESULT_DIR=/path/to/save/result

# encode query
bash scripts/encode_trec_query.sh $ENCODE_DIR $MODEL_DIR $YEAR

# shard search
for i in $(seq 0 19)
do
bash scripts/search_trec_shard.sh $ENCODE_DIR $i $YEAR
done

# reduce
bash scripts/reduce_trec.sh $ENCODE_DIR $RESULT_DIR $YEAR

# evaluation
bash scripts/evaluate_trec.sh $RESULT_DIR $YEAR

Acknowledgement

The code is mainly based on the Tevatron toolkit. We also used some code and data from docTTTTTquery, beir and transformers. Thanks for the great work!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
qg		qg
scripts		scripts
src/tevatron		src/tevatron
trec		trec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dual Cross Encoder

Environment Setup

Experiments

Training

Encoding

Retrieval

Acknowledgement

About

Releases

Packages

Languages

License

jordane95/dual-cross-encoder

Folders and files

Latest commit

History

Repository files navigation

Dual Cross Encoder

Environment Setup

Experiments

Training

Encoding

Retrieval

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages