This is the original repository for the paper Open-Domain Conversational Search Assistant with Transformers available here.
Install/Clone Anserini (Java) following this link.
- You will also need Java 11
Install the rest of the necessary dependencies with the package manager of your choice. If you use conda you can create the env using the search_assistant_env.yml file (depending on the hardware some versions might need to be different):
conda env create -f search_assistant_env.yml
Or you can manually install the necessary packages:
- pandas -
pip install pandas
orconda install pandas
- numpy -
pip install numpy
orconda install numpy
- cbor -
pip install cbor
- torch -
pip install torch torchvision torchaudio
orconda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
- transformers -
pip install transformers
orconda install -c huggingface transformers
- faiss-cpu -
pip install faiss-cpu
orconda install faiss-cpu -c pytorch
- pyserini -
pip install pyserini
or conda version
The Topics and Resolved Topic Annotations from TREC CAsT 2019 are already provided in the folder 2019_data.
Please follow the original location of the data here in the Year 1 (TREC 2019) part to get the documents to index. The download links for the raw data are also provided below but follow the original link for a more comprehensive explanation.
- Link to MS MARCO and MS MARCO duplicate files
- Link to TREC CAR
- Link to WAPO and WAPO duplicate files
Instead of preprocessing and indexing the data yourself, you can also download our anserini/pyserini index from here (~30 GB).
After downloading the data we parse all sources to the same jsonl representation (it also removes the duplicates):
python3 convert_collections_to_jsonl.py --collection "MARCO" --collection_path <.../collection.tsv> --output_folder <path_to_output_folder> --duplicates_file <path to duplicates file>/duplicate_list_v1.0_MARCO.txt
python3 convert_collections_to_jsonl.py --collection "CAR" --collection_path <.../paragraphCorpus/dedup.articles-paragraphs.cbor> --output_folder <path_to_output_folder>
python3 convert_collections_to_jsonl.py --collection "WAPO" --collection_path <.../wapo/WashingtonPost.v2/data/TREC_Washington_Post_collection.v2.jl> --output_folder <path_to_output_folder> --duplicates_file <path to duplicates file>/wapo_duplicate_list_v1.0.txt
This process can take some time. If you are using the default parameters you should have a folder with 51 json files in the end.
Then we use Anserini to index and search the data. To index the data created in the previous step run only one of the commands below:
Index using Anserini:
sh ./anserini/target/appassembler/bin/IndexCollection -collection JsonCollection -generator LuceneDocumentGenerator -threads 2 -input <path to jsonl files folder> -index <.../index_output_location>/car_marco_wapo -storePositions -storeDocvectors -storeRaw -stemmer krovetz
Index using pyserini:
python -m pyserini.index --input <path to jsonl files folder> --collection JsonCollection --generator LuceneDocumentGenerator --index <.../index_output_location>/car_marco_wapo --threads 2 --storePositions --storeDocvectors --storeRaw --stemmer krovetz
The query rewriting model in the paper is based on a T5 model trained on the CANARD dataset.
The model trained on the CANARD is available to download here. Instructions on how to load and use the model are provided in the python notebook available at: colab_notebooks/t5_query_rewriter.ipynb.
If you are only interested in the queries already processed by the query rewriting model you can skip the training of the model and use the queries available at:
-
coreferenced_resolved_files/trec_cast_complete_t5_real_time_v1.json
-
coreferenced_resolved_files/trec_cast_complete_t5_real_time_v2.json
The preprocessed data to train the T5 model is already available in the folder T5_training_data, but if you want to create the data start by downloading the CANARD dataset and unzip it to a folder.
After this run the following commands to obtain the data to train the model (the V1 and V2 formats are explained in the paper):
- Using V1 format:
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/train.json --output_file ./t5_training_data/training_t5_canard_data.tsv --version 1
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/dev.json --output_file ./t5_training_data/validation_t5_canard_data.tsv --version 1
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/test.json --output_file ./t5_training_data/test_t5_canard_data.tsv --version 1
- Using V2 format:
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/train.json --output_file ./t5_training_data/training_t5_canard_data_v2.tsv --version 2
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/dev.json --output_file ./t5_training_data/validation_t5_canard_data_v2.tsv --version 2
python3 generate_data_for_t5.py --collection CANARD --input_file <.../CANARD_Release>/test.json --output_file ./t5_training_data/test_t5_canard_data_v2.tsv --version 2
To generate the input for the T5 model using the TREC CAsT dataset (only for evaluation purposes):
python3 generate_data_for_t5.py --collection CAST --input_file ./2019_data/evaluation_topics_v1.0.json --output_file ./t5_training_data/trec_cast_evaluation.tsv --resolved_file ./2019_data/evaluation_topics_annotated_resolved_v1.0.tsv
To train the model load the python notebook available at: colab_notebooks/t5_query_rewriter.ipynb into Google Colab and follow the steps detailed there.
You will need a Google cloud storage account, and you can use the free credits provided by Google to train the model on TPUs.
Create a bucket on your account, load the training data to that bucket and put the bucket name in the notebook in the places where it is needed.
The colab notebook is a modified version of the original notebook provided by the T5 creators. Follow the instructions on the original repository if in need of any additional information.
At the end of the notebook execution, you should have a trained T5 model for query rewriting.
After this just feed the model the queries from TREC CAsT and gather the outputs to use in the retrieval and reranking steps.
After having the queries generated by the T5 model, we can now evaluate the proposed architecture composed of the transformer query rewriter, the retrieval model, and the transformer reranker on the TREC CAsT 2019 dataset.
As explained in the paper the query rewriter is a T5 model, the retrieval model is LMD, and the transformer reranker is a BERT Model trained on MS MARCO from here.
The TREC CAsT 2019 runs for the evaluation set are already provided in the runs folder, however, you can also create them by following the rest of this section.
To generate the retrieval only runs for all query types run:
python3 run_test_generalizable.py --topics_json_path ./2019_data/evaluation_topics_v1.0.json --qrel_file_path ./2019_data/evaluation_topics_mod.qrel --similarity lmd --index <.../index_output_location>/car_marco_wapo
To generate the retrieval and reranking runs for all query types run (it is highly recommended to use a GPU device to create this run):
python3 run_test_generalizable.py --topics_json_path ./2019_data/evaluation_topics_v1.0.json --qrel_file_path ./2019_data/evaluation_topics_mod.qrel --similarity lmd --index <.../index_output_location>/car_marco_wapo --reranker --reranker_batch_size 8
The run_test_generalizable.py script will generate various .run files available at the runs folder, and various .csv files in the results folder with the unnoficial metrics.
The metrics outputted from the run_test_generalizable.py script are not the official metrics, so we now run the official trec_eval script over the generated runs. We provide a version of trec_eval but you can also download a newer version of trec_eval from here if needed.
If you download a newer version of trec_eval you may need to change some files in the official script to include the metrics at rank 3.
- trec_eval.9.0.4/m_map_cut.c
- trec_eval.9.0.4/m_P.c
- trec_eval.9.0.4/m_rel_P.c
- trec_eval.9.0.4/m_ndcg_cut.c
Just add the desired ranks to the long_cutoff_array variable, e.g.:
static long long_cutoff_array[] = {1, 3, 5, 10, 15, 20, 30, 100, 200, 500, 1000}
Run the command to compile again inside the trec_eval directory:
cd trec_eval.9.0.4 && make
Finally, run this command with every .run file generated by run_test_generalizable.py to get the official results which will be written to the results folder:
python3 run_trec_eval_official_metrics.py --run_name <path to trec run file> --out_file_name <name of output file> --trec_eval_location ./trec_eval.9.0.4/trec_eval --path_to_qrels ./2019_data/evaluation_topics_mod.qrel
If you find anything useful please cite our work using:
@inproceedings{DBLP:conf/ecir/FerreiraLSM21,
author = {Rafael Ferreira and
Mariana Leite and
David Semedo and
Jo{\~{a}}o Magalh{\~{a}}es},
editor = {Djoerd Hiemstra and
Marie{-}Francine Moens and
Josiane Mothe and
Raffaele Perego and
Martin Potthast and
Fabrizio Sebastiani},
title = {Open-Domain Conversational Search Assistant with Transformers},
booktitle = {Advances in Information Retrieval - 43rd European Conference on {IR}
Research, {ECIR} 2021, Virtual Event, March 28 - April 1, 2021, Proceedings,
Part {I}},
series = {Lecture Notes in Computer Science},
volume = {12656},
pages = {130--145},
publisher = {Springer},
year = {2021},
url = {https://doi.org/10.1007/978-3-030-72113-8\_9},
doi = {10.1007/978-3-030-72113-8\_9},
timestamp = {Wed, 07 Apr 2021 16:01:38 +0200},
biburl = {https://dblp.org/rec/conf/ecir/FerreiraLSM21.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}