First published year doesn't account for outliers #5189
Labels
Affects: Data
Issues that affect book/author metadata or user/account data. [managed]
Lead: @cdrini
Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]
Priority: 3
Issues that we can consider at our leisure. [managed]
Theme: Bots
Issues relating to Bots & data cleanup
Type: Bug
Something isn't working. [managed]
The first published year shown on search results (and probably used elsewhere) shows the lowest year of all editions as expected.
However, it doesn't account for outliers so if a work has 100s of editions that show a certain year but one that has year 0001 then it will always show 0001.
Evidence / Screenshot (if possible)
More
Relevant url?
https://openlibrary.org/search?mode=everything&q=The+adventures+of+Tom+Sawyer&sort=old
Steps to Reproduce
Proposal & Constraints
Obviously, we should fix the data. But there is always a possibility of more bad data getting in so it would be nice if we did something to handle the bad data more gracefully. The impact on search is pretty annoying as searching by first published will give you unhelpful results.
We should rely on our librarian friends to help us think through what heuristics may be useful.
We could probably limit the heuristic to books with more than 50 editions or so. That way we don't have to worry about the low numbers. Perhaps some statistical clustering would be be used.
A crude heuristic could be if the average year published is > 1000 and the lowest year published is < 100 ignore it. That would basically weed out the cases where a low number is accidentally set.
My main concerns about any approach:
Is this even an issue worth addressing in this way? Someone familiar with the database could probably run a quick query to see how many books have editions both with a year < 100 and > 1000. From that we could get an idea of how common this problem is.
If these could be user errors then we could warn people when they enter a date that looks like an outlier.
I welcome thoughts from the community 😃
Related files
It is used here:
openlibrary/openlibrary/macros/SearchResultsWork.html
Line 48 in 2bb16bd
It seems to be calculated here:
openlibrary/openlibrary/solr/update_work.py
Line 548 in 2bb16bd
Stakeholders
The text was updated successfully, but these errors were encountered: