IR2: Information Regularization for Synthetic Query Generation of Information Retrieval Tasks (LREC-COLING 2024)
Jianyou (Andre) Wang* Kaicheng Wang* Xiaoyue Wang* Weili Cao Ramamohan Paturi+ Leon Bergen+
Laboratory for Emerging Intelligence (LEI)
CSE Department, University of California, San Diego
La Jolla, CA 92093
Our paper is accepted by LREC-COLING 2024. The arxiv version is available here.
Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces a method of Information Regularization for synthetic query generation aimed at improving data augmentation techniques and consequently, IR systems, by preventing models from learning superficial features of queries. Our approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline—input, prompt, and output—each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios.
We highly recommend creating a new conda environment for the following steps by:
conda create -n info_reg python=3.10
conda activate info_reg
To start, please run
bash setup.sh
Running the above line will create a new conda environment with name info_reg
, download the needed packages and datasets, and unzip the needed files.
Using the data from evaluation/dataset/
and following prompts as shown in prompt/
, the generated queries are stored in generation/
.
To generate results from gpt:
-
Set up your API Key and OpenAI model:
- Open the
gpt_generation.py
file. - Locate the line
openai.api_key = None
and replaceNone
with your OpenAI API key. - Our default model is
"gpt-4-0613"
, change this according to your requirement.
- Open the
-
Configure the Shell Script:
- Open the
run_generation.sh
script in a text editor. - Set the following variables according to your requirements:
thread_num
: Number of threads, usually20
prompt_path
: Path to the prompt file, this should be a pickle that stores a dictionary of format id string to prompt string.name
: prefix of name for the output file, e.g.,"arguana_Dreg_40%"
.start
: Starting index for processing, usually0
.end
: Ending index for processing. This depends on the length of the prompt file
- Open the
-
Run the Generation Script:
- Execute the
run_generation.sh
script to start the generation process. - We recommend using nohup to run the generation process
remember to replace <LOG_FILE> with name of the log file
nohup bash run_generation.sh > generation_log/<LOG_FILE> 2>&1 &
- Execute the
To train an embedding model using synthetically generated data, run the following
python3 -u run_training_evaluation.py -query_type <QUERY_TYPE> -num_experiment <NUM_EXPERIMENTS> -name <EXPERIMENT_NAME> -query_num <QUERY_NUMBER> -half <WHETHER_FREEZE_HALF> -shuffle <SHUFFLE> -cuda <CUDA> -margin <MARGIN> -batch_size <TRAINING_BATCHSIZE> -model_name <MODEL> -evaluation <WHETHER_RUN_EVALUATION>
<QUERY_TYPE>
: Type of query (e.g.arguana_Dreg_80%.pickle
).<NUM_EXPERIMENTS>
: Number of experiments to run.<EXPERIMENT_NAME>
: Name for the experiment.<QUERY_NUMBER>
: Number of queries to process.<WHETHER_FREEZE_HALF>
: Set toTrue
orFalse
depending on whether to train half of the layer.<SHUFFLE>
: Set toTrue
orFalse
to shuffle the data.<CUDA>
: The CUDA used to train the model<MARGIN>
: Margin of the contrastive training<TRAINING_BATCHSIZE>
: Training batch size<MODEL>
: The model to train/evaluate<WHETHER_RUN_EVALUATION>
: Set toTrue
orFalse
to decide whether to run evaluation or not
To train an embedding model using contrastive finetuning on the dataset for domain adaptation, run the following
python3 -u run_contrastive_finetuning.py -query_type <QUERY_TYPE> -num_experiment <NUM_EXPERIMENTS> -name <EXPERIMENT_NAME> -query_num <QUERY_NUMBER> -half <WHETHER_FREEZE_HALF> -shuffle <SHUFFLE> -cuda <CUDA> -margin <MARGIN> -batch_size <TRAINING_BATCHSIZE> -model_name <MODEL> -sent <WHETHER_SENT>
Most of the arguments are same as the previous ones, except for:
<WHETHER_SENT>
: Set toTrue
orFalse
to decide whether to run contrastive finetuning with sentences or the whole paragraph
To train an embedding model using promptagator's queries and consistency filtering, run the following
python3 -u run_promptagator.py -query_type <QUERY_TYPE> -num_experiment <NUM_EXPERIMENTS> -name <EXPERIMENT_NAME> -query_num <QUERY_NUMBER> -half <WHETHER_FREEZE_HALF> -shuffle <SHUFFLE> -cuda <CUDA> -margin <MARGIN> -batch_size <TRAINING_BATCHSIZE> -model_name <MODEL>
Replace each <PLACEHOLDER>
with the appropriate value for your experiment.
The result checkpoint will be stored at training/model_checkpoints