Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterative retrieval in case of non-unique top-k retrieval #25

Closed
dhdhagar opened this issue Apr 11, 2022 · 2 comments
Closed

Iterative retrieval in case of non-unique top-k retrieval #25

dhdhagar opened this issue Apr 11, 2022 · 2 comments

Comments

@dhdhagar
Copy link
Contributor

Hi! Thanks for this amazing work, and for making your code open-source.

I'm trying to figure out where in the code is non-unique passage retrieval handled that ensures that the final k results are unique. According to this footnote on page 3 in your paper "Phrase Retrieval Learns Passage Retrieval, Too", it seems that you perform iterative retrieval to achieve this. Could you point me to the code where this is happening?

image

@jhyuklee
Copy link
Member

Hi @dhdhagar! Thanks for your issue.

In this makefile line:

--top_k 200 \

you can see that we are retrieving top 200 phrases first. And the aggregation based on the unique passages happen here:

https://github.com/princeton-nlp/DensePhrases/blob/main/densephrases/index.py#L430

For our current datasets provided, this is enough to ensure outputting 100 passages per query. There are a very small number of edge cases where top 100 passages are not retrieved even with this setting, and you can enlarge the top-k to 400 in the makefile. Currently there's no automatic procedure for this.

@dhdhagar
Copy link
Contributor Author

That makes it clear, thank you! So, --top_k is used to fetch a larger number of phrases, and the length of the final results depends on --psg_top_k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants