neuralmagic · markurtz · Jun 24, 2021 · Jun 9, 2021 · Jun 9, 2021 · Jun 9, 2021
diff --git a/research/information_retrieval/DPR/README.md b/research/information_retrieval/DPR/README.md
@@ -0,0 +1,48 @@
+# Compressing DPR
+Author: @spacemanidol
+
+Methods
+1. Varying models
+2. Sturctured Pruning
+3. Unstructured Pruning
+4. Dimensionality Reduction
+## Usage
+batch_size: 4
+dev_batch_size: 16
+adam_eps: 1e-8
+adam_betas: (0.9, 0.999)
+max_grad_norm: 2.0
+log_batch_step: 1
+train_rolling_loss_step: 100
+weight_decay: 0.0
+learning_rate: 2e-5
+# Linear warmup over warmup_steps.
+warmup_steps: 1237
+
+# Number of updates steps to accumulate before performing a backward/update pass.
+gradient_accumulation_steps: 1
+
+# Total number of training epochs to perform.
+num_train_epochs: 40
+eval_per_epoch: 1
+hard_negatives: 1
+other_negatives: 0
+val_av_rank_hard_neg: 30
+val_av_rank_other_neg: 30
+val_av_rank_bsz: 128
+val_av_rank_max_qs: 10000
+
+https://www.dropbox.com/s/lvvpsx0cjk4vemv/collection.tar.gz?dl=1
+https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv?dl=1
+https://www.dropbox.com/s/khsplt2fhqwjs0v/qrels.dev.small.tsv?dl=1
+https://www.dropbox.com/s/uzkvv4gpj3a596a/predicted_queries_topk_sampling.zip?dl=1
+https://www.dropbox.com/s/nc1drdkjpxxsngg/run.dev.small.tsv?dl=1
+## Results
+
+| Top-k passages        | Original DPR NQ model           | New DPR model  |
+| ------------- |:-------------:| -----:|
+| 1      | 45.87 | 52.47 |
+| 5      | 68.14      |   72.24 |
+| 20  | 79.97      |    81.33 |
+| 100  | 85.87      |    87.29 |
+### requirements.txt
diff --git a/research/information_retrieval/DPR/conf/README.md b/research/information_retrieval/DPR/conf/README.md
@@ -0,0 +1,65 @@
+## Hydra
+
+[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python
+framework that simplifies the development of research and other complex
+applications. The key feature is the ability to dynamically create a
+hierarchical configuration by composition and override it through config files
+and the command line. 
+
+## DPR configuration
+All DPR tools configuration parameters are now split between different config groups and you can either modify them in the config files or override from command line.
+
+Each tools's (train_dense_encoder.py, generate_dense_embeddings.py, dense_retriever.py and train_reader.py) main method has now a hydra @hydra.main decorator with the name of the configuration file in the conf/ dir.
+For example, dense_retriever.py takes all its parameters from conf/dense_retriever.yaml file.
+Every tool's configuration files refers to other configuration files via "defaults:" parameter. 
+It is called a [configuration group](https://hydra.cc/docs/tutorials/structured_config/config_groups) in Hydra.
+
+Let's take a look at dense_retriever.py's configuration:
+
+
+```yaml
+
+defaults:
+  - encoder: hf_bert
+  - datasets: retriever_default
+  - ctx_sources: default_sources
+
+indexers:
+  flat:
+    _target_: dpr.indexer.faiss_indexers.DenseFlatIndexer
+
+  hnsw:
+    _target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer
+
+  hnsw_sq:
+    _target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer
+
+...
+qa_dataset:
+...
+ctx_datatsets:
+...
+indexer: flat
+...
+
+```
+
+"  - encoder: " - a configuration group that contains all parameters to instantiate the encoder. The actual parameters are located in conf/encoder/hf_bert.yaml file.
+If you want to override some of them, you can either 
+- Modify that config file
+- Create a new config group file under  conf/encoder/ folder and enable to use it by providing encoder={your file name} command line argument
+- Override specific parameter from command line. For example: encoder.sequence_length=300
+
+"  - datasets:" - a configuration group that contains a list of all possible sources of queries for evaluation. One can find them in conf/datasets/retriever_default.yaml file.
+One should specify the dataset to use by providing qa_dataset parameter in order to use one of them during evaluation. For example, if you want to run the retriever on NQ test set, set qa_dataset=nq_test as a command line parameter.
+
+It is much easier now to use custom datasets, without the need to convert them to DPR format. Just define your own class that provides relevant __getitem__(), __len__() and load_data() methods (inherit from QASrc).
+
+"  - ctx_sources: " - a configuration group that contains a list of all possible passage sources.  One can find them in conf/ctx_sources/default_sources.yaml file.
+One should specify a list of names of the passages datasets as ctx_datatsets parameter. For example, if you want to use dpr's old wikipedia passages, set ctx_datatsets=[dpr_wiki]. 
+Please note that this parameter is a list and you can effectively concatenate different passage source into one. In order to use multiple sources at once, one also needs to provide relevant embeddings files in encoded_ctx_files parameter, which is also a list.
+
+
+"indexers:" - a parameters map that defines various indexes. The actual index is selected by indexer parameter which is 'flat' by default but you can use loss index types by setting indexer=hnsw or indexer=hnsw_sq in the command line.
+
+Please refer to the configuration files comments for every parameter.
diff --git a/research/information_retrieval/DPR/conf/biencoder_train_cfg.yaml b/research/information_retrieval/DPR/conf/biencoder_train_cfg.yaml
@@ -0,0 +1,47 @@
+
+# configuration groups
+defaults:
+  - encoder: hf_bert
+  - train: biencoder_default
+  - datasets: encoder_train_default
+
+train_datasets:
+dev_datasets:
+output_dir:
+train_sampling_rates:
+loss_scale_factors:
+
+# Whether to lower case the input text. Set True for uncased models, False for the cased ones.
+do_lower_case: True
+
+fix_ctx_encoder: False
+val_av_rank_start_epoch: 30
+seed: 12345
+checkpoint_file_name: dpr_biencoder
+
+# A trained bi-encoder checkpoint file to initialize the model
+model_file:
+
+# TODO: move to a conf group
+# local_rank for distributed training on gpus
+local_rank: -1
+global_loss_buf_sz: 592000
+device:
+distributed_world_size:
+distributed_port:
+no_cuda: False
+n_gpu:
+fp16: True
+
+# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+#        "See details at https://nvidia.github.io/apex/amp.html
+fp16_opt_level: O1
+
+# tokens which won't be slit by tokenizer
+special_tokens:
+
+ignore_checkpoint_offset: False
+ignore_checkpoint_optimizer: False
+
+# set to >1 to enable multiple query encoders
+multi_q_encoder: False
diff --git a/research/information_retrieval/DPR/conf/ctx_sources/default_sources.yaml b/research/information_retrieval/DPR/conf/ctx_sources/default_sources.yaml
@@ -0,0 +1,6 @@
+# @package _group_
+
+dpr_wiki:
+  _target_: dpr.data.retriever_data.CsvCtxSrc
+  file: data.wikipedia_split.psgs_w100
+  id_prefix: 'wiki:'
diff --git a/research/information_retrieval/DPR/conf/datasets/encoder_train_default.yaml b/research/information_retrieval/DPR/conf/datasets/encoder_train_default.yaml
@@ -0,0 +1,46 @@
+# @package _group_
+
+nq_train:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.nq-train
+
+nq_train_hn1:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.nq-adv-hn-train
+
+nq_dev:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.nq-dev
+
+trivia_train:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.trivia-train
+
+trivia_dev:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.trivia-dev
+
+squad1_train:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.squad1-train
+
+squad1_dev:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.squad1-dev
+
+webq_train:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.webq-train
+
+webq_dev:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.webq-dev
+
+curatedtrec_train:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.curatedtrec-train
+
+curatedtrec_dev:
+  _target_: dpr.data.biencoder_data.JsonQADataset
+  file: data.retriever.curatedtrec-dev
+
diff --git a/research/information_retrieval/DPR/conf/datasets/retriever_default.yaml b/research/information_retrieval/DPR/conf/datasets/retriever_default.yaml
@@ -0,0 +1,33 @@
+# @package _group_
+
+nq_test:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.nq-test
+
+nq_train:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.nq-train
+
+nq_dev:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.nq-dev
+
+trivia_test:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.trivia-test
+
+trivia_train:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.trivia-train
+
+trivia_dev:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.trivia-dev
+
+webq_test:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.webq-test
+
+curatedtrec_test:
+  _target_: dpr.data.retriever_data.CsvQASrc
+  file: data.retriever.qas.curatedtrec-test
diff --git a/research/information_retrieval/DPR/conf/dense_retriever.yaml b/research/information_retrieval/DPR/conf/dense_retriever.yaml
@@ -0,0 +1,71 @@
+defaults:
+  - encoder: hf_bert # defines encoder initialization parameters
+  - datasets: retriever_default # contains a list of all possible sources of queries for evaluation. Specific set is selected by qa_dataset parameter
+  - ctx_sources: default_sources # contains a list of all possible passage sources. Specific passages sources selected by ctx_datatsets parameter
+
+indexers:
+  flat:
+    _target_: dpr.indexer.faiss_indexers.DenseFlatIndexer
+
+  hnsw:
+    _target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer
+
+  hnsw_sq:
+    _target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer
+
+# the name of the queries dataset from the 'datasets' config group
+qa_dataset:
+
+# a list of names of the passages datasets from the 'ctx_sources' config group
+ctx_datatsets:
+
+#Glob paths to encoded passages (from generate_dense_embeddings tool)
+encoded_ctx_files: []
+
+out_file:
+# "regex" or "string"
+match: string
+n_docs: 100
+validation_workers: 16
+
+# Batch size to generate query embeddings
+batch_size: 128
+
+# Whether to lower case the input text. Set True for uncased models, False for the cased ones.
+do_lower_case: True
+
+# The attribute name of encoder to use for queries. Options for the BiEncoder model: question_model, ctx_model
+# question_model is used if this param is empty
+encoder_path:
+
+# path to the FAISS index location - it is only needed if you want to serialize faiss index to files or read from them
+# (instead of using encoded_ctx_files)
+# it should point to either directory or a common index files prefix name
+# if there is no index at the specific location, the index will be created from encoded_ctx_files
+index_path:
+
+kilt_out_file:
+
+# A trained bi-encoder checkpoint file to initialize the model
+model_file:
+
+validate_as_tables: False
+rpc_retriever_cfg_file:
+indexer: flat
+
+# tokens which won't be slit by tokenizer
+special_tokens:
+
+# TODO: move to a conf group
+# local_rank for distributed training on gpus
+local_rank: -1
+global_loss_buf_sz: 150000
+device:
+distributed_world_size:
+no_cuda: False
+n_gpu:
+fp16: False
+
+# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+#        "See details at https://nvidia.github.io/apex/amp.html
+fp16_opt_level: O1
diff --git a/research/information_retrieval/DPR/conf/encoder/hf_bert.yaml b/research/information_retrieval/DPR/conf/encoder/hf_bert.yaml
@@ -0,0 +1,24 @@
+# @package _group_
+
+# model type. One of [hf_bert, pytext_bert, fairseq_roberta]
+encoder_model_type: hf_bert
+
+# HuggingFace's config name for model initialization
+pretrained_model_cfg: bert-base-uncased
+
+# Some encoders need to be initialized from a file
+pretrained_file:
+
+# Extra linear layer on top of standard bert/roberta encoder
+projection_dim: 0
+
+# Max length of the encoder input sequence
+sequence_length: 256
+
+dropout: 0.1
+
+# whether to fix (don't update) context encoder during training or not
+fix_ctx_encoder: False
+
+# if False, the model won't load pre-trained BERT weights
+pretrained: True