PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval, Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin and Guido Zuccon.
We recommend using a conda environment to install the required dependencies.
conda create -n promptreps python=3.10
conda activate promptreps
# clone this repo
git clone https://github.com/ielab/PromptReps.git
cd PromptReps
Our code is build on top of the Tevatron library. To install the required dependencies, run the following command:
Note: our code is tested with Tevatron main branch with commit id d1816cf.
git clone https://github.com/texttron/tevatron.git
cd tevatron
pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss-cpu # or 'conda install pytorch::faiss-gpu' for faiss gpu search
pip install ranx
pip install nltk
pip install -e .
cd ..
We also use Pyserini to build inverted index for sparse representations and evaluate the results. To install it, run the following command:
conda install -c conda-forge openjdk=21 maven -y
pip install pyserini
If you have any issues with the pyserini installation, please follow this link.
In this example, we show an experiment with nfcorpus dataset from BEIR using the microsoft/Phi-3-mini-4k-instruct model.
BASE_MODEL=microsoft/Phi-3-mini-4k-instruct
DATASET=nfcorpus
OUTPUT_DIR=outputs/${BASE_MODEL}/
You can change experiments with other LLMs on huggingface model hub by changing the BASE_MODEL
variable.
But you may also need to add prompts in prompts/${BASE_MODEL}
directory.
Similarly, you can change the dataset by changing the DATASET
variable to other BEIR dataset names listed here.
We store the results and intermediate files in the OUTPUT_DIR
directory.
python encode.py \
--output_dir=temp \
--model_name_or_path ${BASE_MODEL} \
--tokenizer_name ${BASE_MODEL} \
--per_device_eval_batch_size 64 \
--query_max_len 512 \
--normalize \
--dataset_name Tevatron/beir \
--dataset_config ${DATASET} \
--dataset_split test \
--dense_output_dir ${OUTPUT_DIR}/beir/${DATASET}/dense \
--sparse_output_dir ${OUTPUT_DIR}/beir/${DATASET}/sparse \
--encode_is_query \
--bf16 \
--query_prefix prompts/${BASE_MODEL}/query_prefix.txt \
--query_suffix prompts/${BASE_MODEL}/query_suffix.txt \
--cache_dir cache_models \
--dataset_cache_dir cache_datasets \
--sparse_exact_match
For large corpus, we shard the document collection and encode each shard in parallel with multiple GPUs.
For example, if you have two GPUs:
NUM_AVAILABLE_GPUS=2
for i in $(seq 0 $((NUM_AVAILABLE_GPUS-1)))
do
CUDA_VISIBLE_DEVICES=${i} python encode.py \
--output_dir=temp \
--model_name_or_path ${BASE_MODEL} \
--tokenizer_name ${BASE_MODEL} \
--per_device_eval_batch_size 64 \
--passage_max_len 512 \
--normalize \
--bf16 \
--dataset_name Tevatron/beir-corpus \
--dataset_config ${DATASET} \
--dense_output_dir ${OUTPUT_DIR}/beir/${DATASET}/dense \
--sparse_output_dir ${OUTPUT_DIR}/beir/${DATASET}/sparse \
--passage_prefix prompts/${BASE_MODEL}/passage_prefix.txt \
--passage_suffix prompts/${BASE_MODEL}/passage_suffix.txt \
--cache_dir cache_models \
--dataset_cache_dir cache_datasets \
--dataset_number_of_shards ${NUM_AVAILABLE_GPUS} \
--dataset_shard_index ${i} \
--sparse_exact_match &
done
wait
mkdir -p ${OUTPUT_DIR}/beir/${DATASET}/results
python search.py \
--query_reps ${OUTPUT_DIR}/beir/${DATASET}/dense/query.pkl \
--passage_reps ${OUTPUT_DIR}'/beir/'${DATASET}'/dense/corpus_*.pkl' \
--depth 1000 \
--batch_size 64 \
--save_text \
--save_ranking_to ${OUTPUT_DIR}/beir/${DATASET}/results/rank.dense.txt
# add '--use_gpu' if you want to use gpu for search
# convert to trec run format
python -m tevatron.utils.format.convert_result_to_trec --input ${OUTPUT_DIR}/beir/${DATASET}/results/rank.dense.txt \
--output ${OUTPUT_DIR}/beir/${DATASET}/results/rank.dense.trec \
--remove_query
# Build inverted index
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input ${OUTPUT_DIR}/beir/${DATASET}/sparse \
--index ${OUTPUT_DIR}/beir/${DATASET}/sparse/index \
--generator DefaultLuceneDocumentGenerator \
--threads 16 \
--impact --pretokenized
# search
python -m pyserini.search.lucene \
--index ${OUTPUT_DIR}/beir/${DATASET}/sparse/index \
--topics ${OUTPUT_DIR}/beir/${DATASET}/sparse/query.tsv \
--output ${OUTPUT_DIR}/beir/${DATASET}/results/rank.sparse.trec \
--output-format trec \
--batch 32 --threads 16 \
--hits 1000 \
--impact --pretokenized --remove-query
python hybrid.py \
--run_1 ${OUTPUT_DIR}/beir/${DATASET}/results/rank.dense.trec \
--run_2 ${OUTPUT_DIR}/beir/${DATASET}/results/rank.sparse.trec \
--alpha 0.5 \
--save_path ${OUTPUT_DIR}/beir/${DATASET}/results/rank.hybrid.trec
# Dense results
python -m pyserini.eval.trec_eval -c -m recall.100,1000 -m ndcg_cut.10 beir-v1.0.0-${DATASET}-test ${OUTPUT_DIR}/beir/${DATASET}/results/rank.dense.trec
#Sparse results
python -m pyserini.eval.trec_eval -c -m recall.100,1000 -m ndcg_cut.10 beir-v1.0.0-${DATASET}-test ${OUTPUT_DIR}/beir/${DATASET}/results/rank.sparse.trec
#Hybrid results
python -m pyserini.eval.trec_eval -c -m recall.100,1000 -m ndcg_cut.10 beir-v1.0.0-${DATASET}-test ${OUTPUT_DIR}/beir/${DATASET}/results/rank.hybrid.trec
You will get the following results:
Dense results:
recall_100 all 0.2617
recall_1000 all 0.5531
ndcg_cut_10 all 0.2780
Sparse results:
recall_100 all 0.2410
recall_1000 all 0.4415
ndcg_cut_10 all 0.2938
Hybrid results:
recall_100 all 0.2853
recall_1000 all 0.5678
ndcg_cut_10 all 0.3325
If you used our code for your research, please consider to cite our paper :)
@misc{zhuang2024promptreps,
title={PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval},
author={Shengyao Zhuang and Xueguang Ma and Bevan Koopman and Jimmy Lin and Guido Zuccon},
year={2024},
eprint={2404.18424},
archivePrefix={arXiv},
primaryClass={cs.IR}
}