Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

alexlimh · 2021-11-22T20:52:54Z

I've built the compressed DensePhrase index on SQuAD using OPQ96. I haven't run any query-side finetuning yet but here are the results:

11/22/2021 19:50:57 - INFO - main - no_ans/all: 0, 10570
11/22/2021 19:50:57 - INFO - main - Evaluating 10570 answers
11/22/2021 19:50:58 - INFO - main - EM: 21.63, F1: 27.96
11/22/2021 19:50:58 - INFO - main - 1) Which NFL team represented the AFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], top 5 prediction: ['Denver Broncos', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers']
11/22/2021 19:50:58 - INFO - main - 2) Which NFL team represented the NFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], top 5 prediction: ['San Francisco 49ers', 'Chicago Bears', 'Seattle Seahawks', 'Tampa Bay Buccaneers', 'Green Bay Packers']
11/22/2021 19:50:58 - INFO - main - 3) Where did Super Bowl 50 take place
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], top 5 prediction: ['Tacoma, Washington, USA', "Levi's Stadium in Santa Clara, California", 'DeVault Vineyards in Concord, Virginia', "Levi's Stadium in Santa Clara", 'Jinan Olympic Sports Center Gymnasium in Jinan, China']
11/22/2021 19:53:44 - INFO - main - {'exact_match_top1': 21.62724692526017, 'f1_score_top1': 27.958255585698414}
11/22/2021 19:53:44 - INFO - main - {'exact_match_top200': 57.48344370860927, 'f1_score_top200': 73.28679644685603}
11/22/2021 19:53:44 - INFO - main - {'redundancy of top200': 5.308987701040681}
11/22/2021 19:53:44 - INFO - main - Saving prediction file to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200.pred
10570it [00:23, 448.84it/s]
11/22/2021 19:54:58 - INFO - main - avg psg len=124.84 for 10570 preds
11/22/2021 19:54:58 - INFO - main - dump to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200_psg-top100.json
ctx token length: 124.84
unique titles: 98.20

Top-1 = 27.02%
Top-5 = 42.80%
Top-20 = 56.40%
Top-100 = 69.20%
Acc@1 when Acc@100 = 39.05%
MRR@20 = 34.30
P@20 = 8.94

I understand that index compression results in accuracy loss w/o query-side finetuning. However, the score still looks a little bit too low to me. Could @jhyuklee confirm whether this looks alright?

jhyuklee · 2021-11-22T21:22:38Z

Hi, is this using the entire Wikipedia for phrase dump? or just squad development set passages?
If it's just using the squad development set passages, this is pretty low. I think I got over 60 EM for SQuAD. Even densephrases-multi will have at least 50 EM.

alexlimh · 2021-11-22T21:38:11Z

Hi, is this using the entire Wikipedia for phrase dump? or just squad development set passages?

This is using the entire Wikipedia dump and tested on sqd-open-qa.

jhyuklee · 2021-11-22T22:34:04Z

If this is using the entire Wikipedia dump, then I think this is a good start. You'll able to reach 35~40 EM after query-side fine-tuning. Make sure that you set larger max_answer_length for SQuAD because they do have larger length answers.

jhyuklee · 2021-11-22T22:35:27Z

I'm not sure which model you used for generating the phrase vecs, but it seems to be slightly low compared to what is uploaded in Github (densephrases-multi scores 29EM before QSFT on SQuAD).

alexlimh · 2021-11-22T23:13:12Z

Thanks! That's really helpful.

Make sure that you set larger max_answer_length for SQuAD because they do have larger length answers.

Is there a default max_answer_length for SQuAD that you've been using?

I'm not sure which model you used for generating the phrase vecs

The model I used is densephrases-squad-ddp which is trained using the following command:

make run-rc-sqd-ddp MODEL_NAME=densephrases-squad-ddp

run-rc-sqd-ddp: model-name sqd-rc-data sqd-param pbn-param medium1-index
make train-rc-ddp \
  TRAIN_DATA=$(TRAIN_QG_DATA) DEV_DATA=$(DEV_DATA) \
  TEACHER_NAME=$(TEACHER_NAME) MODEL_NAME=$(MODEL_NAME)_tmp \
  BS=$(BS) LR=$(LR) MAX_SEQ_LEN=$(MAX_SEQ_LEN) \
  LAMBDA_KL=$(LAMBDA_KL) LAMBDA_NEG=$(LAMBDA_NEG)
make train-rc-ddp \
  TRAIN_DATA=$(TRAIN_DATA) DEV_DATA=$(DEV_DATA) \
  TEACHER_NAME=$(TEACHER_NAME) MODEL_NAME=$(MODEL_NAME) \
  BS=$(BS) LR=$(LR) MAX_SEQ_LEN=$(MAX_SEQ_LEN) \
  LAMBDA_KL=$(LAMBDA_KL) LAMBDA_NEG=$(LAMBDA_NEG) \
  OPTIONS='$(PBN_OPTIONS) --load_dir $(SAVE_DIR)/$(MODEL_NAME)_tmp'

train-rc-ddp: model-name sqd-rc-data sqd-param
	OMP_NUM_THREADS=20 python -m torch.distributed.launch \
		--nnode=1 --node_rank=0 --nproc_per_node=4 train_rc.py \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--data_dir $(DATA_DIR)/single-qa \
		--cache_dir $(CACHE_DIR) \
		--train_file $(TRAIN_DATA) \
		--predict_file $(DEV_DATA) \
		--do_train \
		--do_eval \
		--fp16 \
		--per_gpu_train_batch_size $(BS) \
		--learning_rate $(LR) \
		--num_train_epochs 2.0 \
		--max_seq_length $(MAX_SEQ_LEN) \
		--lambda_kl $(LAMBDA_KL) \
		--lambda_neg $(LAMBDA_NEG) \
		--lambda_flt 1.0 \
		--filter_threshold -2.0 \
		--append_title \
		--evaluate_during_training \
		--teacher_dir $(SAVE_DIR)/$(TEACHER_NAME) \
		--output_dir $(SAVE_DIR)/$(MODEL_NAME) \
		$(OPTIONS)

Generating Vecs (in parallel):

make gen-vecs-parallel MODEL_NAME=densephrases-squad-ddp START=$start END=$end

gen-vecs-parallel: model-name
python scripts/parallel/dump_phrases.py \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--cache_dir $(CACHE_DIR) \
		--data_dir $(DATA_DIR)/wikidump \
		--data_name wiki-20181220 \
		--load_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--output_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--filter_threshold 1.0 \
		--append_title \
		--start $(START) \
		--end $(END) \
		--num_gpus 4

Building Index:

make index-vecs DUMP_DIR=$SAVE_DIR/densephrases-squad-ddp_wiki-20181220/dump/ NUM_CLUSTERS=1048576 INDEX_TYPE=OPQ96

index-vecs: dump-dir large-index
python build_phrase_index.py \
		--dump_dir $(DUMP_DIR) \
		--stage all \
		--replace \
		--num_clusters $(NUM_CLUSTERS) \
		--fine_quant $(INDEX_TYPE) \
		--cuda

Compressing meta:

make compress-meta DUMP_DIR=$SAVE_DIR/densephrases-squad-ddp_wiki-20181220/dump

compress-meta:
	python scripts/preprocess/compress_metadata.py \
		--input_dump_dir $(DUMP_DIR)/phrase \
		--output_dir $(DUMP_DIR)

Evaluating index:

make eval-index-psg-sqd MODEL_NAME=densephrases-squad-ddp DUMP_DIR=outputs/densephrases-squad-ddp_wiki-20181220/dump/

eval-index-psg-sqd: dump-dir model-name large-index sqd-open-data
	python eval_phrase_retrieval.py \
		--run_mode eval \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--cuda \
		--dump_dir $(DUMP_DIR) \
		--index_name start/$(NUM_CLUSTERS)_flat_$(INDEX_TYPE) \
		--load_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--test_path $(DATA_DIR)/$(TEST_DATA) \
		--save_pred \
		--aggregate \
		--agg_strat opt2 \
		--top_k 200 \
		--eval_psg \
		--psg_top_k 100 \
		$(OPTIONS)

jhyuklee · 2021-11-22T23:24:02Z

Looks like you did a great job for the entire process (except that I don't know why you chose medium-index for run-rc-sqd-ddp not small-index. SQuAD dev set passages are only about 2k, so small-index should be fine.)

First, I have to mention that the current hyperparameters (sqd-param) are not very ddp friendly, but only for a single 24GB GPU. If you want to use ddp for training, I strongly suggest you to change the hyper parameters (larger batch sizes might require larger learning rates) and keep track of the accuracy (i.e., semi-OD accuracy on SQuAD passages) right after the run-rc-sqd-ddp, which strongly correlate with the final open-domain QA accuracy.

Second, for SQuAD, maximum sequence length matters more than the batch size in my experience. Also for max_answer_length, it is now set to 10, which is for NQ (since it has at most 5 words as an answer), but you can set it to 20 for SQuAD.

alexlimh · 2021-11-22T23:35:52Z

I see. But I look at the training log it seems densephrases is working fine on the dev set:

OMP_NUM_THREADS=20 python -m torch.distributed.launch \
	--nnode=1 --node_rank=0 --nproc_per_node=4 train_rc.py \
	--model_type bert \
	--pretrained_name_or_path SpanBERT/spanbert-base-cased \
	--data_dir .//densephrases-data/single-qa \
	--cache_dir .//cache \
	--train_file squad/train-v1.1_qg_ents_t5large_3500_filtered.json \
	--predict_file squad/dev-v1.1.json \
	--do_train \
	--do_eval \
	--fp16 \
	--per_gpu_train_batch_size 24 \
	--learning_rate 3e-5 \
	--num_train_epochs 2.0 \
	--max_seq_length 384 \
	--lambda_kl 4.0 \
	--lambda_neg 2.0 \
	--lambda_flt 1.0 \
	--filter_threshold -2.0 \
	--append_title \
	--evaluate_during_training \
	--teacher_dir .//outputs/spanbert-base-cased-squad \
	--output_dir .//outputs/densephrases-squad-ddp_tmp \

...

Evaluating: 100%|█████████▉| 904/905 [00:53<00:00, 16.45it/s]
10570it [00:53, 196.62it/s]
11/17/2021 00:23:52 - INFO - densephrases.utils.squad_metrics -   saved vecs=1104943/15389478, save rate=0.0718
11/17/2021 00:23:52 - INFO - densephrases.utils.squad_metrics -   answer recall=0.0000

Evaluating: 100%|█████████▉| 904/905 [00:55<00:00, 16.34it/s]
11/17/2021 00:23:55 - INFO - __main__ -   Evaluation done in total 56.362774 secs (0.005194 sec per example)
11/17/2021 00:23:55 - INFO - __main__ -   Results: {'exact_final': 75.49668874172185, 'f1_final': 84.58729121922526, 'total_final': 10570, 'HasAns_exact_final': 75.49668874172185, 'HasAns_f1_final': 84.58729121922526, 'HasAns_total_final': 10570, 'best_exact_final': 75.49668874172185, 'best_exact_thresh_final': 0.0, 'best_f1_final': 84.58729121922526, 'best_f1_thresh_final': 0.0}

jhyuklee · 2021-11-23T00:47:00Z

Oh, I think you are missing this part where you can evaluate your model based on the semi-open domain setup (using all development set passages). This is a better approximation of open-domain QA.

DensePhrases/Makefile

Lines 219 to 230 in b52fe06

    
           make gen-vecs \ 
        
           	DEV_DATA=$(DEV_DATA) MODEL_NAME=$(MODEL_NAME) 
        
           make index-vecs \ 
        
           	DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump \ 
        
           	NUM_CLUSTERS=$(NUM_CLUSTERS) INDEX_TYPE=$(INDEX_TYPE) 
        
           make compress-meta \ 
        
           	DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump 
        
           make eval-index \ 
        
           	DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump \ 
        
           	NUM_CLUSTERS=$(NUM_CLUSTERS) INDEX_TYPE=$(INDEX_TYPE) \ 
        
           	MODEL_LANE=$(MODEL_NAME) TEST_DATA=$(SOD_DATA) \ 
        
           	OPTIONS=$(OPTIONS)

alexlimh · 2021-11-23T03:57:43Z

Got it. will try the semi-domain evaluation and see if it works.

jhyuklee closed this as completed Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021 •

edited

jhyuklee commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 23, 2021

alexlimh commented Nov 23, 2021

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

Comments

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021 • edited

jhyuklee commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 22, 2021

alexlimh commented Nov 22, 2021

jhyuklee commented Nov 23, 2021

alexlimh commented Nov 23, 2021

alexlimh commented Nov 22, 2021 •

edited