Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions research/information_retrieval/DPR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Compressing DPR
Author: @spacemanidol

Methods
1. Varying models
2. Sturctured Pruning
3. Unstructured Pruning
4. Dimensionality Reduction
## Usage
batch_size: 4
dev_batch_size: 16
adam_eps: 1e-8
adam_betas: (0.9, 0.999)
max_grad_norm: 2.0
log_batch_step: 1
train_rolling_loss_step: 100
weight_decay: 0.0
learning_rate: 2e-5
# Linear warmup over warmup_steps.
warmup_steps: 1237

# Number of updates steps to accumulate before performing a backward/update pass.
gradient_accumulation_steps: 1

# Total number of training epochs to perform.
num_train_epochs: 40
eval_per_epoch: 1
hard_negatives: 1
other_negatives: 0
val_av_rank_hard_neg: 30
val_av_rank_other_neg: 30
val_av_rank_bsz: 128
val_av_rank_max_qs: 10000

https://www.dropbox.com/s/lvvpsx0cjk4vemv/collection.tar.gz?dl=1
https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv?dl=1
https://www.dropbox.com/s/khsplt2fhqwjs0v/qrels.dev.small.tsv?dl=1
https://www.dropbox.com/s/uzkvv4gpj3a596a/predicted_queries_topk_sampling.zip?dl=1
https://www.dropbox.com/s/nc1drdkjpxxsngg/run.dev.small.tsv?dl=1
## Results

| Top-k passages | Original DPR NQ model | New DPR model |
| ------------- |:-------------:| -----:|
| 1 | 45.87 | 52.47 |
| 5 | 68.14 | 72.24 |
| 20 | 79.97 | 81.33 |
| 100 | 85.87 | 87.29 |
### requirements.txt
65 changes: 65 additions & 0 deletions research/information_retrieval/DPR/conf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## Hydra

[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python
framework that simplifies the development of research and other complex
applications. The key feature is the ability to dynamically create a
hierarchical configuration by composition and override it through config files
and the command line.

## DPR configuration
All DPR tools configuration parameters are now split between different config groups and you can either modify them in the config files or override from command line.

Each tools's (train_dense_encoder.py, generate_dense_embeddings.py, dense_retriever.py and train_reader.py) main method has now a hydra @hydra.main decorator with the name of the configuration file in the conf/ dir.
For example, dense_retriever.py takes all its parameters from conf/dense_retriever.yaml file.
Every tool's configuration files refers to other configuration files via "defaults:" parameter.
It is called a [configuration group](https://hydra.cc/docs/tutorials/structured_config/config_groups) in Hydra.

Let's take a look at dense_retriever.py's configuration:


```yaml

defaults:
- encoder: hf_bert
- datasets: retriever_default
- ctx_sources: default_sources

indexers:
flat:
_target_: dpr.indexer.faiss_indexers.DenseFlatIndexer

hnsw:
_target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer

hnsw_sq:
_target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer

...
qa_dataset:
...
ctx_datatsets:
...
indexer: flat
...

```

" - encoder: " - a configuration group that contains all parameters to instantiate the encoder. The actual parameters are located in conf/encoder/hf_bert.yaml file.
If you want to override some of them, you can either
- Modify that config file
- Create a new config group file under conf/encoder/ folder and enable to use it by providing encoder={your file name} command line argument
- Override specific parameter from command line. For example: encoder.sequence_length=300

" - datasets:" - a configuration group that contains a list of all possible sources of queries for evaluation. One can find them in conf/datasets/retriever_default.yaml file.
One should specify the dataset to use by providing qa_dataset parameter in order to use one of them during evaluation. For example, if you want to run the retriever on NQ test set, set qa_dataset=nq_test as a command line parameter.

It is much easier now to use custom datasets, without the need to convert them to DPR format. Just define your own class that provides relevant __getitem__(), __len__() and load_data() methods (inherit from QASrc).

" - ctx_sources: " - a configuration group that contains a list of all possible passage sources. One can find them in conf/ctx_sources/default_sources.yaml file.
One should specify a list of names of the passages datasets as ctx_datatsets parameter. For example, if you want to use dpr's old wikipedia passages, set ctx_datatsets=[dpr_wiki].
Please note that this parameter is a list and you can effectively concatenate different passage source into one. In order to use multiple sources at once, one also needs to provide relevant embeddings files in encoded_ctx_files parameter, which is also a list.


"indexers:" - a parameters map that defines various indexes. The actual index is selected by indexer parameter which is 'flat' by default but you can use loss index types by setting indexer=hnsw or indexer=hnsw_sq in the command line.

Please refer to the configuration files comments for every parameter.
47 changes: 47 additions & 0 deletions research/information_retrieval/DPR/conf/biencoder_train_cfg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@

# configuration groups
defaults:
- encoder: hf_bert
- train: biencoder_default
- datasets: encoder_train_default

train_datasets:
dev_datasets:
output_dir:
train_sampling_rates:
loss_scale_factors:

# Whether to lower case the input text. Set True for uncased models, False for the cased ones.
do_lower_case: True

fix_ctx_encoder: False
val_av_rank_start_epoch: 30
seed: 12345
checkpoint_file_name: dpr_biencoder

# A trained bi-encoder checkpoint file to initialize the model
model_file:

# TODO: move to a conf group
# local_rank for distributed training on gpus
local_rank: -1
global_loss_buf_sz: 592000
device:
distributed_world_size:
distributed_port:
no_cuda: False
n_gpu:
fp16: True

# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
# "See details at https://nvidia.github.io/apex/amp.html
fp16_opt_level: O1

# tokens which won't be slit by tokenizer
special_tokens:

ignore_checkpoint_offset: False
ignore_checkpoint_optimizer: False

# set to >1 to enable multiple query encoders
multi_q_encoder: False
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# @package _group_

dpr_wiki:
_target_: dpr.data.retriever_data.CsvCtxSrc
file: data.wikipedia_split.psgs_w100
id_prefix: 'wiki:'
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# @package _group_

nq_train:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.nq-train

nq_train_hn1:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.nq-adv-hn-train

nq_dev:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.nq-dev

trivia_train:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.trivia-train

trivia_dev:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.trivia-dev

squad1_train:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.squad1-train

squad1_dev:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.squad1-dev

webq_train:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.webq-train

webq_dev:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.webq-dev

curatedtrec_train:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.curatedtrec-train

curatedtrec_dev:
_target_: dpr.data.biencoder_data.JsonQADataset
file: data.retriever.curatedtrec-dev

Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# @package _group_

nq_test:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.nq-test

nq_train:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.nq-train

nq_dev:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.nq-dev

trivia_test:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.trivia-test

trivia_train:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.trivia-train

trivia_dev:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.trivia-dev

webq_test:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.webq-test

curatedtrec_test:
_target_: dpr.data.retriever_data.CsvQASrc
file: data.retriever.qas.curatedtrec-test
71 changes: 71 additions & 0 deletions research/information_retrieval/DPR/conf/dense_retriever.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
defaults:
- encoder: hf_bert # defines encoder initialization parameters
- datasets: retriever_default # contains a list of all possible sources of queries for evaluation. Specific set is selected by qa_dataset parameter
- ctx_sources: default_sources # contains a list of all possible passage sources. Specific passages sources selected by ctx_datatsets parameter

indexers:
flat:
_target_: dpr.indexer.faiss_indexers.DenseFlatIndexer

hnsw:
_target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer

hnsw_sq:
_target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer

# the name of the queries dataset from the 'datasets' config group
qa_dataset:

# a list of names of the passages datasets from the 'ctx_sources' config group
ctx_datatsets:

#Glob paths to encoded passages (from generate_dense_embeddings tool)
encoded_ctx_files: []

out_file:
# "regex" or "string"
match: string
n_docs: 100
validation_workers: 16

# Batch size to generate query embeddings
batch_size: 128

# Whether to lower case the input text. Set True for uncased models, False for the cased ones.
do_lower_case: True

# The attribute name of encoder to use for queries. Options for the BiEncoder model: question_model, ctx_model
# question_model is used if this param is empty
encoder_path:

# path to the FAISS index location - it is only needed if you want to serialize faiss index to files or read from them
# (instead of using encoded_ctx_files)
# it should point to either directory or a common index files prefix name
# if there is no index at the specific location, the index will be created from encoded_ctx_files
index_path:

kilt_out_file:

# A trained bi-encoder checkpoint file to initialize the model
model_file:

validate_as_tables: False
rpc_retriever_cfg_file:
indexer: flat

# tokens which won't be slit by tokenizer
special_tokens:

# TODO: move to a conf group
# local_rank for distributed training on gpus
local_rank: -1
global_loss_buf_sz: 150000
device:
distributed_world_size:
no_cuda: False
n_gpu:
fp16: False

# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
# "See details at https://nvidia.github.io/apex/amp.html
fp16_opt_level: O1
24 changes: 24 additions & 0 deletions research/information_retrieval/DPR/conf/encoder/hf_bert.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# @package _group_

# model type. One of [hf_bert, pytext_bert, fairseq_roberta]
encoder_model_type: hf_bert

# HuggingFace's config name for model initialization
pretrained_model_cfg: bert-base-uncased

# Some encoders need to be initialized from a file
pretrained_file:

# Extra linear layer on top of standard bert/roberta encoder
projection_dim: 0

# Max length of the encoder input sequence
sequence_length: 256

dropout: 0.1

# whether to fix (don't update) context encoder during training or not
fix_ctx_encoder: False

# if False, the model won't load pre-trained BERT weights
pretrained: True
Loading