Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

Closed
alexlimh opened this issue Nov 22, 2021 · 9 comments
Closed

Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22

alexlimh opened this issue Nov 22, 2021 · 9 comments

Comments

@alexlimh
Copy link

I've built the compressed DensePhrase index on SQuAD using OPQ96. I haven't run any query-side finetuning yet but here are the results:


11/22/2021 19:50:57 - INFO - main - no_ans/all: 0, 10570
11/22/2021 19:50:57 - INFO - main - Evaluating 10570 answers
11/22/2021 19:50:58 - INFO - main - EM: 21.63, F1: 27.96
11/22/2021 19:50:58 - INFO - main - 1) Which NFL team represented the AFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], top 5 prediction: ['Denver Broncos', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers']
11/22/2021 19:50:58 - INFO - main - 2) Which NFL team represented the NFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], top 5 prediction: ['San Francisco 49ers', 'Chicago Bears', 'Seattle Seahawks', 'Tampa Bay Buccaneers', 'Green Bay Packers']
11/22/2021 19:50:58 - INFO - main - 3) Where did Super Bowl 50 take place
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], top 5 prediction: ['Tacoma, Washington, USA', "Levi's Stadium in Santa Clara, California", 'DeVault Vineyards in Concord, Virginia', "Levi's Stadium in Santa Clara", 'Jinan Olympic Sports Center Gymnasium in Jinan, China']
11/22/2021 19:53:44 - INFO - main - {'exact_match_top1': 21.62724692526017, 'f1_score_top1': 27.958255585698414}
11/22/2021 19:53:44 - INFO - main - {'exact_match_top200': 57.48344370860927, 'f1_score_top200': 73.28679644685603}
11/22/2021 19:53:44 - INFO - main - {'redundancy of top200': 5.308987701040681}
11/22/2021 19:53:44 - INFO - main - Saving prediction file to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200.pred
10570it [00:23, 448.84it/s]
11/22/2021 19:54:58 - INFO - main - avg psg len=124.84 for 10570 preds
11/22/2021 19:54:58 - INFO - main - dump to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200_psg-top100.json
ctx token length: 124.84
unique titles: 98.20

Top-1 = 27.02%
Top-5 = 42.80%
Top-20 = 56.40%
Top-100 = 69.20%
Acc@1 when Acc@100 = 39.05%
MRR@20 = 34.30
P@20 = 8.94


I understand that index compression results in accuracy loss w/o query-side finetuning. However, the score still looks a little bit too low to me. Could @jhyuklee confirm whether this looks alright?

@jhyuklee
Copy link
Member

Hi, is this using the entire Wikipedia for phrase dump? or just squad development set passages?
If it's just using the squad development set passages, this is pretty low. I think I got over 60 EM for SQuAD. Even densephrases-multi will have at least 50 EM.

@alexlimh
Copy link
Author

alexlimh commented Nov 22, 2021

Hi, is this using the entire Wikipedia for phrase dump? or just squad development set passages?

This is using the entire Wikipedia dump and tested on sqd-open-qa.

@jhyuklee
Copy link
Member

If this is using the entire Wikipedia dump, then I think this is a good start. You'll able to reach 35~40 EM after query-side fine-tuning. Make sure that you set larger max_answer_length for SQuAD because they do have larger length answers.

@jhyuklee
Copy link
Member

I'm not sure which model you used for generating the phrase vecs, but it seems to be slightly low compared to what is uploaded in Github (densephrases-multi scores 29EM before QSFT on SQuAD).

@alexlimh
Copy link
Author

Thanks! That's really helpful.

Make sure that you set larger max_answer_length for SQuAD because they do have larger length answers.

Is there a default max_answer_length for SQuAD that you've been using?

I'm not sure which model you used for generating the phrase vecs

The model I used is densephrases-squad-ddp which is trained using the following command:

make run-rc-sqd-ddp MODEL_NAME=densephrases-squad-ddp
run-rc-sqd-ddp: model-name sqd-rc-data sqd-param pbn-param medium1-index
make train-rc-ddp \
  TRAIN_DATA=$(TRAIN_QG_DATA) DEV_DATA=$(DEV_DATA) \
  TEACHER_NAME=$(TEACHER_NAME) MODEL_NAME=$(MODEL_NAME)_tmp \
  BS=$(BS) LR=$(LR) MAX_SEQ_LEN=$(MAX_SEQ_LEN) \
  LAMBDA_KL=$(LAMBDA_KL) LAMBDA_NEG=$(LAMBDA_NEG)
make train-rc-ddp \
  TRAIN_DATA=$(TRAIN_DATA) DEV_DATA=$(DEV_DATA) \
  TEACHER_NAME=$(TEACHER_NAME) MODEL_NAME=$(MODEL_NAME) \
  BS=$(BS) LR=$(LR) MAX_SEQ_LEN=$(MAX_SEQ_LEN) \
  LAMBDA_KL=$(LAMBDA_KL) LAMBDA_NEG=$(LAMBDA_NEG) \
  OPTIONS='$(PBN_OPTIONS) --load_dir $(SAVE_DIR)/$(MODEL_NAME)_tmp'
train-rc-ddp: model-name sqd-rc-data sqd-param
	OMP_NUM_THREADS=20 python -m torch.distributed.launch \
		--nnode=1 --node_rank=0 --nproc_per_node=4 train_rc.py \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--data_dir $(DATA_DIR)/single-qa \
		--cache_dir $(CACHE_DIR) \
		--train_file $(TRAIN_DATA) \
		--predict_file $(DEV_DATA) \
		--do_train \
		--do_eval \
		--fp16 \
		--per_gpu_train_batch_size $(BS) \
		--learning_rate $(LR) \
		--num_train_epochs 2.0 \
		--max_seq_length $(MAX_SEQ_LEN) \
		--lambda_kl $(LAMBDA_KL) \
		--lambda_neg $(LAMBDA_NEG) \
		--lambda_flt 1.0 \
		--filter_threshold -2.0 \
		--append_title \
		--evaluate_during_training \
		--teacher_dir $(SAVE_DIR)/$(TEACHER_NAME) \
		--output_dir $(SAVE_DIR)/$(MODEL_NAME) \
		$(OPTIONS)

Generating Vecs (in parallel):

make gen-vecs-parallel MODEL_NAME=densephrases-squad-ddp START=$start END=$end
gen-vecs-parallel: model-name
python scripts/parallel/dump_phrases.py \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--cache_dir $(CACHE_DIR) \
		--data_dir $(DATA_DIR)/wikidump \
		--data_name wiki-20181220 \
		--load_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--output_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--filter_threshold 1.0 \
		--append_title \
		--start $(START) \
		--end $(END) \
		--num_gpus 4

Building Index:

make index-vecs DUMP_DIR=$SAVE_DIR/densephrases-squad-ddp_wiki-20181220/dump/ NUM_CLUSTERS=1048576 INDEX_TYPE=OPQ96
index-vecs: dump-dir large-index
python build_phrase_index.py \
		--dump_dir $(DUMP_DIR) \
		--stage all \
		--replace \
		--num_clusters $(NUM_CLUSTERS) \
		--fine_quant $(INDEX_TYPE) \
		--cuda

Compressing meta:

make compress-meta DUMP_DIR=$SAVE_DIR/densephrases-squad-ddp_wiki-20181220/dump
compress-meta:
	python scripts/preprocess/compress_metadata.py \
		--input_dump_dir $(DUMP_DIR)/phrase \
		--output_dir $(DUMP_DIR)

Evaluating index:

make eval-index-psg-sqd MODEL_NAME=densephrases-squad-ddp DUMP_DIR=outputs/densephrases-squad-ddp_wiki-20181220/dump/
eval-index-psg-sqd: dump-dir model-name large-index sqd-open-data
	python eval_phrase_retrieval.py \
		--run_mode eval \
		--model_type bert \
		--pretrained_name_or_path SpanBERT/spanbert-base-cased \
		--cuda \
		--dump_dir $(DUMP_DIR) \
		--index_name start/$(NUM_CLUSTERS)_flat_$(INDEX_TYPE) \
		--load_dir $(SAVE_DIR)/$(MODEL_NAME) \
		--test_path $(DATA_DIR)/$(TEST_DATA) \
		--save_pred \
		--aggregate \
		--agg_strat opt2 \
		--top_k 200 \
		--eval_psg \
		--psg_top_k 100 \
		$(OPTIONS)

@jhyuklee
Copy link
Member

Looks like you did a great job for the entire process (except that I don't know why you chose medium-index for run-rc-sqd-ddp not small-index. SQuAD dev set passages are only about 2k, so small-index should be fine.)

First, I have to mention that the current hyperparameters (sqd-param) are not very ddp friendly, but only for a single 24GB GPU. If you want to use ddp for training, I strongly suggest you to change the hyper parameters (larger batch sizes might require larger learning rates) and keep track of the accuracy (i.e., semi-OD accuracy on SQuAD passages) right after the run-rc-sqd-ddp, which strongly correlate with the final open-domain QA accuracy.

Second, for SQuAD, maximum sequence length matters more than the batch size in my experience. Also for max_answer_length, it is now set to 10, which is for NQ (since it has at most 5 words as an answer), but you can set it to 20 for SQuAD.

@alexlimh
Copy link
Author

I see. But I look at the training log it seems densephrases is working fine on the dev set:

OMP_NUM_THREADS=20 python -m torch.distributed.launch \
	--nnode=1 --node_rank=0 --nproc_per_node=4 train_rc.py \
	--model_type bert \
	--pretrained_name_or_path SpanBERT/spanbert-base-cased \
	--data_dir .//densephrases-data/single-qa \
	--cache_dir .//cache \
	--train_file squad/train-v1.1_qg_ents_t5large_3500_filtered.json \
	--predict_file squad/dev-v1.1.json \
	--do_train \
	--do_eval \
	--fp16 \
	--per_gpu_train_batch_size 24 \
	--learning_rate 3e-5 \
	--num_train_epochs 2.0 \
	--max_seq_length 384 \
	--lambda_kl 4.0 \
	--lambda_neg 2.0 \
	--lambda_flt 1.0 \
	--filter_threshold -2.0 \
	--append_title \
	--evaluate_during_training \
	--teacher_dir .//outputs/spanbert-base-cased-squad \
	--output_dir .//outputs/densephrases-squad-ddp_tmp \

...

Evaluating: 100%|█████████▉| 904/905 [00:53<00:00, 16.45it/s]
10570it [00:53, 196.62it/s]
11/17/2021 00:23:52 - INFO - densephrases.utils.squad_metrics -   saved vecs=1104943/15389478, save rate=0.0718
11/17/2021 00:23:52 - INFO - densephrases.utils.squad_metrics -   answer recall=0.0000

Evaluating: 100%|█████████▉| 904/905 [00:55<00:00, 16.34it/s]
11/17/2021 00:23:55 - INFO - __main__ -   Evaluation done in total 56.362774 secs (0.005194 sec per example)
11/17/2021 00:23:55 - INFO - __main__ -   Results: {'exact_final': 75.49668874172185, 'f1_final': 84.58729121922526, 'total_final': 10570, 'HasAns_exact_final': 75.49668874172185, 'HasAns_f1_final': 84.58729121922526, 'HasAns_total_final': 10570, 'best_exact_final': 75.49668874172185, 'best_exact_thresh_final': 0.0, 'best_f1_final': 84.58729121922526, 'best_f1_thresh_final': 0.0}

@jhyuklee
Copy link
Member

Oh, I think you are missing this part where you can evaluate your model based on the semi-open domain setup (using all development set passages). This is a better approximation of open-domain QA.

DensePhrases/Makefile

Lines 219 to 230 in b52fe06

make gen-vecs \
DEV_DATA=$(DEV_DATA) MODEL_NAME=$(MODEL_NAME)
make index-vecs \
DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump \
NUM_CLUSTERS=$(NUM_CLUSTERS) INDEX_TYPE=$(INDEX_TYPE)
make compress-meta \
DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump
make eval-index \
DUMP_DIR=$(SAVE_DIR)/$(MODEL_NAME)/dump \
NUM_CLUSTERS=$(NUM_CLUSTERS) INDEX_TYPE=$(INDEX_TYPE) \
MODEL_LANE=$(MODEL_NAME) TEST_DATA=$(SOD_DATA) \
OPTIONS=$(OPTIONS)

@alexlimh
Copy link
Author

Got it. will try the semi-domain evaluation and see if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants