Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textually-duplicate passages in msmarco v2 #8

Open
seanmacavaney opened this issue Jul 16, 2021 · 1 comment
Open

textually-duplicate passages in msmarco v2 #8

seanmacavaney opened this issue Jul 16, 2021 · 1 comment

Comments

@seanmacavaney
Copy link

I noticed that there's a sizeable number of passages in the v2 corpus that have text that exactly matches other passages: ~27.8 million passages, which amounts to around 20% of all passages in the corpus. Sometimes it's extremely prevalent, with one passage even being repeated 23,680 times [1]. [code] [file containing the duplicate passage IDs]

This is realistic, of course, since multiple documents often do contain the same passage. This is reflected in the other passage fields. I am wondering how this will affect evaluation, though. If I recall correctly, in the past NIST assessors evaluated the passage retrieval task irrespective of the context from the document. Is that the case again this year, or will the associated document also be considered? If only the passage text is considered, how will duplicates be handled?

[1] FWIW cases like this particular one (msmarco_passage_27_152452064, an advertising disclosure from Yellow Pages) are rather unlikely to be an answer to an actual question. Other exact duplicates are high-quality answers, though.

@craswell
Copy link
Contributor

Yeah, we knew dupes will be an issue with the new datasets. It's a realistic dataset right now, but in such realistic situations we know there would be deduping mechanisms in the retrieval system, to prevent users from seeing duplicate stuff.

If people participate in the passage task by first ranking documents, then ranking the passages of the top-k documents, then we definitely don't want to remove any passages, just because they also appeared in some other doc. So if we wanted to do some unrealistic form of deduping, it might look more like having a single passage ID that points to many different documents, so no passage-document connections are lost.

Having said that, as soon as we get rid of exact dupes then there are "near dupes" to consider.

Overall our approach so far was to do some testing to make sure the collection is usable with the current training+dev sets, which seems to be the case, and we'll have to figure some further steps later. Thanks for raising the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants