Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First published year doesn't account for outliers #5189

Open
RayBB opened this issue May 17, 2021 · 2 comments
Open

First published year doesn't account for outliers #5189

RayBB opened this issue May 17, 2021 · 2 comments
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Bots Issues relating to Bots & data cleanup Type: Bug Something isn't working. [managed]

Comments

@RayBB
Copy link
Collaborator

RayBB commented May 17, 2021

The first published year shown on search results (and probably used elsewhere) shows the lowest year of all editions as expected.

However, it doesn't account for outliers so if a work has 100s of editions that show a certain year but one that has year 0001 then it will always show 0001.

Evidence / Screenshot (if possible)

image

More image

Relevant url?

https://openlibrary.org/search?mode=everything&q=The+adventures+of+Tom+Sawyer&sort=old

Steps to Reproduce

  1. Go to the search results page and see the issue.
  • Actual: Shows an incorrect first published year
  • Expected: Shows a correct first published year.

Proposal & Constraints

Obviously, we should fix the data. But there is always a possibility of more bad data getting in so it would be nice if we did something to handle the bad data more gracefully. The impact on search is pretty annoying as searching by first published will give you unhelpful results.

We should rely on our librarian friends to help us think through what heuristics may be useful.

We could probably limit the heuristic to books with more than 50 editions or so. That way we don't have to worry about the low numbers. Perhaps some statistical clustering would be be used.

A crude heuristic could be if the average year published is > 1000 and the lowest year published is < 100 ignore it. That would basically weed out the cases where a low number is accidentally set.

My main concerns about any approach:

  • How too handle a book that was published once and then discovered hundreds of years later and republished?
  • It could hide the underlying data issues and prevent them from being addressed quickly. To resolve this, a page could be created that shows the edition that are marked as outliers.

Is this even an issue worth addressing in this way? Someone familiar with the database could probably run a quick query to see how many books have editions both with a year < 100 and > 1000. From that we could get an idea of how common this problem is.

If these could be user errors then we could warn people when they enter a date that looks like an outlier.

I welcome thoughts from the community 😃

Related files

It is used here:

$_('First published in %(year)s', year=doc.first_publish_year)

It seems to be calculated here:

add('first_publish_year', min(int(y) for y in pub_years))

Stakeholders

@RayBB RayBB added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Bug Something isn't working. [managed] labels May 17, 2021
@tfmorris
Copy link
Contributor

Any algorithmic bandaid developed for the app could be just as easily applied to a data cleaning job to fix the actual data and benefit users of both the API and data dumps.

Data quality can't be ignored forever -- or at least it shouldn't be.

@mekarpeles mekarpeles added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Theme: Bots Issues relating to Bots & data cleanup Priority: 3 Issues that we can consider at our leisure. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels May 18, 2021
@LeadSongDog
Copy link

LeadSongDog commented May 19, 2021

Some obvious edition date tests:

  1. Editions can’t be older than their authors (or publishers)
  2. Editions with valid ISBNs can’t be older than 1965 (9 digit SBNs from the first few years later gained a prefix zero) except for facsimile reprints
  3. Century years are usually bogus, e.g. from converting 18XX to 1800
  4. Jan 1st is usually bogus (publishers rarely work holidays) and rarely useful

No such edition date should be used to establish the work’s first-published date.

@mekarpeles mekarpeles added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Bots Issues relating to Bots & data cleanup Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

4 participants