Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASIN-only imports from Amazon include inappropriate and duplicate sellers' items #2674

Open
seabelis opened this issue Nov 29, 2019 · 10 comments
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Affiliate API Type: Bug Something isn't working. [managed]

Comments

@seabelis
Copy link
Collaborator

seabelis commented Nov 29, 2019

As I understand it, Amazon assigns an ASIN to each item being sold; this does not have to be a single book nor does it have to be sets of books that were originally published as a set. I have recently noticed imports of seller-created bundles that are not appropriate for the OpenLibrary catalog. Such items should be excluded from imports if possible.

Relevant url?

List: https://openlibrary.org/people/seabelis/lists/OL143669L/Bad_ASIN_imports

This looks like it is legitimately a set, but does it need to be represented as such in the catalog? I think records for the individual volumes is sufficient. In any case, it was imported from multiple sellers. https://openlibrary.org/works/OL20110771W

Details

  • Logged in (Y/N)?
  • Browser type/version?
  • Operating system?
  • Environment (prod/dev/local)? prod

Proposal & Constraints

Related files

Stakeholders

@hornc

@seabelis seabelis added the Type: Bug Something isn't working. [managed] label Nov 29, 2019
@seabelis seabelis changed the title ASIN-only imports from Amazon include inappropriate sellers' items ASIN-only imports from Amazon include inappropriate and duplicate sellers' items Nov 29, 2019
@xayhewalo xayhewalo added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged labels Nov 29, 2019
@tfmorris
Copy link
Contributor

tfmorris commented Dec 1, 2019

Sounds very similar to #709

@hornc hornc self-assigned this Dec 2, 2019
@hornc
Copy link
Collaborator

hornc commented Dec 2, 2019

Thanks for reporting this one @seabelis !

@hornc
Copy link
Collaborator

hornc commented Dec 2, 2019

What I'm finding with a lot of these ASINs (and other in the data) is that they 404, these seller ASINs do not seem to be persistent IDs at all.

This one is found:
https://openlibrary.org/books/OL27290823M/3_Titles_By_John_Steinbeck_Sweet_Thursday_East_of_Eden_The_Grapes_of_Wrath.
https://www.amazon.com/gp/product/B001OTLKD2?tag=internetarchi-20
https://openlibrary.org/prices?asin=B001OTLKD2

I was hoping to find a categorization that indicates that this not a single book, but it does seem lumped in with books.

@seabelis
Copy link
Collaborator Author

seabelis commented Dec 2, 2019

Why are ASIN-only items being imported? What is the expectation? It seems there is very little good data to be gotten. I may be incorrect about this, but it seems like most of the data will be 3rd-party seller items that don't have any ISBN or a suitable corresponding ISBN in the amazon catalog; how are duplicates checked on these? Usually the third-party items that fall into this category have limited or inaccurate information. As I have seen on GR, these items get imported over and over if the seller re-lists.

ASINs also correspond to kindle and audible items, but those things DO have corresponding ISBNs and should not be creating new items in the catalog unless it's checking for or including ISBN in the new record.

@LeadSongDog
Copy link

@hornc
The 404's may just have been a local or transient issue. The amazon url lands for me today. As you may have known, it is not a single book, but three distinct books with a common author (John Steinbeck), publisher (Penguin Classics), and publication date (1986) and format (paperback). In this case it was someone bundling them for sale, presumably just to make shipping worthwhile to the buyer.
I still contend that ASIN-only entries should only be imported for Kindle books. The crap:information ratio is just too high otherwise. This is just one example.

@tfmorris
Copy link
Contributor

tfmorris commented Dec 3, 2019

I noticed that most of them were 404ing as well. How did these ASINs even make it into our import queue? The Steinbeck books item above is a) below #20,000,000 on the top sellers list, so I can't imagine anyone is linking to it and b) has a title that doesn't match any of his works.

We should reject items like this if they make it into the queue, but they shouldn't even be in the import queue in the first place.

@hornc hornc added this to To do in lead board test Dec 18, 2019
@mekarpeles mekarpeles added the Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] label Dec 18, 2019
@xayhewalo xayhewalo added Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Dec 24, 2019
@hornc
Copy link
Collaborator

hornc commented Jan 12, 2020

@seabelis, there is an intent to import pre-ISBN physical books with AISN, e.g. https://www.amazon.com/Greek-studies-Gilbert-Murray/dp/B0007JAFEA

I have not been able to determine a way to tell the difference between an ebook and a pre-ISBN book (I don't think there is a way)

I think we do want to import:

  • Pre-ISBN books with ASINs
  • Modern E-books with ASINs

but we do not want to import seller bundled items like the examples in this issue report.

@seabelis
Copy link
Collaborator Author

seabelis commented Jan 13, 2020

@hornc I think the issue with ASINs is they are just amazon catalog numbers; they don't indicate that something is a book or a unique book. What is the expectation for the imported Kindle and Audible items; will they be imported with their ISBNs? Even Goodreads has separate items for Kindle ( by ASIN) and their corresponding ebook records (by ISBN). I'm not sure if this is a marketing choice or because they cannot import the ISBN for Kindle items. I don't think it's useful for Open Library to mirror the amazon catalog; Goodreads has clear incentive to be.

The provided example is a third-party seller item; these usually have low-quality or incomplete data; what is the benefit of importing them?

@LeadSongDog
Copy link

A few seconds on Worldcat found that edition and numerous others including translations to Spanish and French.
https://www.worldcat.org/search?q=Greek+studies+1947+Gilbert+Murray

If an AMZ record has title, year, publisher, and author it should be straightforward to find the matching OCLC entry and get some more reliable catalogue data to work with, vastly improving the entry:
https://openlibrary.org/books/OL27890402M/Greek_studies?_compare=Compare&b=2&a=1&m=diff

otoh, when there is no match for these basics in Worldcat, the odds that AMZ has it correct dwindle into insignificance. Absent a match we should ignore the AMZ entry.

@tfmorris
Copy link
Contributor

there is an intent to import pre-ISBN physical books with AISN, e.g. https://www.amazon.com/Greek-studies-Gilbert-Murray/dp/B0007JAFEA

We already have that edition: https://openlibrary.org/books/OL26546996M/Greek_studies

Actually we have a 1948 printing of the 1947 reprint edition, but it's cataloged as being published in 1946.

What are the odds that Amazon will have a good quality catalog record for an item that exists nowhere else?

@hornc hornc removed their assignment Mar 7, 2020
@scottbarnes scottbarnes mentioned this issue Mar 24, 2023
35 tasks
@hornc hornc removed the Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] label Sep 10, 2023
@mekarpeles mekarpeles added the Lead: @scottbarnes Issues overseen by Scott (Community Imports) label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Affiliate API Type: Bug Something isn't working. [managed]
Projects
No open projects
Development

No branches or pull requests

7 participants