Skip to content
/ COTED Public

This is a temporary repository of our paper "Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval""

Notifications You must be signed in to change notification settings

kyriemao/COTED

Repository files navigation

COTED: Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval

This is a temporary anonymous repository of the paper "Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval""

image

Prerequisites

Install dependencies:

pip install -r requirements.txt

Data

We provide two raw and preprocessed CAsT datasets in the datasets folder. Besides, the human annotation data is in the annotation_data folder. Please note that, although there are part of turn dependency annotations in the original dataset of CAsT 20, we find that it is not very accurate and sufficient. Therefore, we refine the original annnotation by our team.

Main Files

  • train.py: curriculum_sampling, two-step multi-task learning
  • test.py: test with Faiss
  • my_utils.py: useful functions
  • models.py: CQE model architecture (i.e., ANCE)
  • db_lib.py: data strctures, conversational data augmentation, curriculum_sampling
  • running scripts:
    • train_cast19.sh
    • test_cast19.sh
    • train_cast20.sh
    • test_cast20.sh

Training

First download the public pre-trained ANCE model to the checkpoints folder.

mkdir checkpoints
wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
wget https://data.thunlp.org/convdr/ad-hoc-ance-orquac.cp
unzip Passage_ANCE_FirstP_Checkpoint.zip
mv "Passage ANCE(FirstP) Checkpoint" ad-hoc-ance-msmarco

To train our COTED, run the following scripts.

# params: training_epoch, aug_ratio, loss_weight

# CAsT-19
bash train_cast19.sh 6 2 0.01

# CAsT-20
bash train_cast20.sh 6 3 0.01

Testing

For testing, you should first generate passages embeddings.

Use

python gen_tokenized_doc.py --config=gen_tokenized_doc.toml

python gen_doc_embedding.py --config=gen_doc_embedding.toml

Then, run the following scripts for testing.

The passages embeddings are expected to stored at ./datasets/collections/cast_shared/passage_embeddings.

# param: test_epoch

# CAsT-19
bash test_cast19.sh 6 
or
bash test_cast19.sh final

# CAsT-20
bash test_cast20.sh 6
or
bash test_cast20.sh final

About

This is a temporary repository of our paper "Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval""

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published