-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduction of DensePhrase (w/ PQ, w/o qft) on SQuAD #22
Comments
Hi, is this using the entire Wikipedia for phrase dump? or just squad development set passages? |
This is using the entire Wikipedia dump and tested on sqd-open-qa. |
If this is using the entire Wikipedia dump, then I think this is a good start. You'll able to reach 35~40 EM after query-side fine-tuning. Make sure that you set larger max_answer_length for SQuAD because they do have larger length answers. |
I'm not sure which model you used for generating the phrase vecs, but it seems to be slightly low compared to what is uploaded in Github (densephrases-multi scores 29EM before QSFT on SQuAD). |
Thanks! That's really helpful.
Is there a default max_answer_length for SQuAD that you've been using?
The model I used is
Generating Vecs (in parallel):
Building Index:
Compressing meta:
Evaluating index:
|
Looks like you did a great job for the entire process (except that I don't know why you chose First, I have to mention that the current hyperparameters ( Second, for SQuAD, maximum sequence length matters more than the batch size in my experience. Also for max_answer_length, it is now set to 10, which is for NQ (since it has at most 5 words as an answer), but you can set it to 20 for SQuAD. |
I see. But I look at the training log it seems densephrases is working fine on the dev set:
|
Oh, I think you are missing this part where you can evaluate your model based on the semi-open domain setup (using all development set passages). This is a better approximation of open-domain QA. Lines 219 to 230 in b52fe06
|
Got it. will try the semi-domain evaluation and see if it works. |
I've built the compressed DensePhrase index on SQuAD using OPQ96. I haven't run any query-side finetuning yet but here are the results:
11/22/2021 19:50:57 - INFO - main - no_ans/all: 0, 10570
11/22/2021 19:50:57 - INFO - main - Evaluating 10570 answers
11/22/2021 19:50:58 - INFO - main - EM: 21.63, F1: 27.96
11/22/2021 19:50:58 - INFO - main - 1) Which NFL team represented the AFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], top 5 prediction: ['Denver Broncos', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers', 'Pittsburgh Steelers']
11/22/2021 19:50:58 - INFO - main - 2) Which NFL team represented the NFC at Super Bowl 50
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Carolina Panthers', 'Carolina Panthers', 'Carolina Panthers'], top 5 prediction: ['San Francisco 49ers', 'Chicago Bears', 'Seattle Seahawks', 'Tampa Bay Buccaneers', 'Green Bay Packers']
11/22/2021 19:50:58 - INFO - main - 3) Where did Super Bowl 50 take place
11/22/2021 19:50:58 - INFO - main - => groundtruths: ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], top 5 prediction: ['Tacoma, Washington, USA', "Levi's Stadium in Santa Clara, California", 'DeVault Vineyards in Concord, Virginia', "Levi's Stadium in Santa Clara", 'Jinan Olympic Sports Center Gymnasium in Jinan, China']
11/22/2021 19:53:44 - INFO - main - {'exact_match_top1': 21.62724692526017, 'f1_score_top1': 27.958255585698414}
11/22/2021 19:53:44 - INFO - main - {'exact_match_top200': 57.48344370860927, 'f1_score_top200': 73.28679644685603}
11/22/2021 19:53:44 - INFO - main - {'redundancy of top200': 5.308987701040681}
11/22/2021 19:53:44 - INFO - main - Saving prediction file to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200.pred
10570it [00:23, 448.84it/s]
11/22/2021 19:54:58 - INFO - main - avg psg len=124.84 for 10570 preds
11/22/2021 19:54:58 - INFO - main - dump to .//outputs/densephrases-squad-ddp/pred/test_preprocessed_10570_top200_psg-top100.json
ctx token length: 124.84
unique titles: 98.20
Top-1 = 27.02%
Top-5 = 42.80%
Top-20 = 56.40%
Top-100 = 69.20%
Acc@1 when Acc@100 = 39.05%
MRR@20 = 34.30
P@20 = 8.94
I understand that index compression results in accuracy loss w/o query-side finetuning. However, the score still looks a little bit too low to me. Could @jhyuklee confirm whether this looks alright?
The text was updated successfully, but these errors were encountered: