-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable Amazon imports from /isbn #8690
Re-enable Amazon imports from /isbn #8690
Conversation
277ea98
to
3e0f4db
Compare
af040fd
to
f789ed7
Compare
I know this is still a draft, but I just wanted to comment on this:
I agree that any metadata from Amazon needs to be cleaned / filtered, but it seems like a tall task. For example, all these "author" records were created from Amazon for a single book: OL9952815A Gulsen Heper, Steven D. Levitt, Stephen J. Dubner, Iclal Buyukdevrim Ozcelik OpenLibrary already has the book and the authors, so I don't see what value there is in an attempting to untangle the Amazon mess (and I'm not even sure it's possible). Not that some of the names in the author string are translators and that it can also include other superfluous text like "editor" or "trans." Creating these junk records just creates more work for the librarians. |
I think the That said, this will do nothing to address your point that the matching algorithm here seems to be failing miserably once the Amazon (or BWB, or wherever) data is put into the proper format for import. :( Here, for example, I see that the data that comes back from Amazon for this book (in the unit test) lists "Mary GrandePré" as an author of Harry Potter, rather than the illustrator (for the US Editions). I also see the API is having trouble with the I wouldn't call it a bright side to this PR, but at the very least it does stop automatic imports from Amazon. But if someone or something uses the |
This PR reenables AMZ imports from `/isbn`. The logic upon visiting `/isbn/[some isbn]` is now: - attempt to fetch the book from the OL database; - attempt to fetch the book from the `import_item` table (likely ISBNdb); - attempt to fetch the metadata from the Amazon Products API, clean that metadata for import, add the as a `staged` import in the `import_item` table, and then immediately import it `load()`, by way of `ImportItem.import_first_staged()`. If any of these find or create an edition, the ention is returned.
Import AMZ records as `staged`. See internetarchive#8541
`/isbn`. E.g., `http://localhost:31337/isbn/059035342X?priority=0`. A priority of `0` will put the ISBN straight to the front of the queue for an AMZ Products API look up, and attempt for three seconds to fetch the cached AMZ data (if it becomes available), returning that data, marshalled into a form suitable for creating an idition, if possible.
then queue for import and immediately import the item, returning the resulting `Edition`, or `None`.
This commit: - adds an `import_missing` parameter to `/api/books` to make the API try to import books from ISBN. - relies on changes to `Edition.from_isbn()` which attempt to import editions first from `staged` items in the `import_item` table, and second from Amazon via the affiliate server. This commit likely belongs in a separate PR, but for sake of convienent testing it is for the moment included in the `/isbn` re-enabling PR.
f789ed7
to
c2a4d33
Compare
61373aa
to
de92680
Compare
`queue.PriorityQueue` gives priority to whatever would be returned by `min([queue_items])`. However, setting `Priority.priority=0` can look a bit like priority is disabled. Using an `Enum` may help clarify how priority works, as `Priority.HIGH` has a value of `0`, and is used when sorting priority items in the queue.
34c2b04
to
ef360ba
Compare
Q: Let's say patron a) Does it add a second entry into the queue with different priority but same b) We should make sure that some version of the isbn does get prioritized in the event that we do unique isbns before they go into the queue. <-- this is the more significant case, i.e. if we prioritize, the isbn should go to front of the line and block. |
What happens if /api/books requests multiple isbns -- and I think, it's not a problem, because our digitization center won't be doing this and that is likely the only thing that will use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; (1) let's change the /isbn
to not prioritize by default and thus from_isbn
will prioritize=False
by default (2) at minimum, add stats.incr to track failures of affiliate server priority case.
Accessing
|
72ab44a
to
1e9bfb6
Compare
openlibrary/core/models.py
Outdated
'status': 'staged', | ||
'data': cleaned_metadata, | ||
} | ||
batch = get_current_amazon_batch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of get_current_amazon_batch, let's assume there will be a single batch amz
for amazon, keyed by amazon id / isbn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is ready to go, merge at you discretion after testing and possibly followup w/ @cdrini re: dbnet
dad5be9
to
396a2a4
Compare
396a2a4
to
342ed8f
Compare
* Renable `/isbn` and AMZ imports This PR reenables AMZ imports from `/isbn` and `/api/books.json`. See this comment examples of how to use the endpoints and what to expect: internetarchive#8690 (comment) The logic upon visiting `/isbn/[some isbn]` is now: - attempt to fetch the book from the OL database; - attempt to fetch the book from the `import_item` table (likely ISBNdb); - attempt to fetch the metadata from the Amazon Products API, clean that metadata for import, add the as a `staged` import in the `import_item` table, and then immediately import it `load()`, by way of `ImportItem.import_first_staged()`. If any of these find or create an edition, the ention is returned. * Stop bulk imports from AMZ records Import AMZ records as `staged`. See internetarchive#8541 * Modify the affiliate server to accept a GET parameter, `high_priority`, at `/isbn`. E.g., `http://localhost:31337/isbn/059035342X?high_priority=true`. `high_priority=true` will put the ISBN straight to the front of the queue for an AMZ Products API look up, and attempt for three seconds to fetch the cached AMZ data (if it becomes available), returning that data, marshalled into a form suitable for creating an Edition, if possible. * Use `high_priority=false` (the default) on the affiliate server to fetch AM data if available,then queue for import and immediately import the item, returning the resulting `Edition`, or `None`. * Feature: `/api/books` will attempt to import from ISBN - adds an `high_priority` parameter to `/api/books` to make the API try to import books from ISBN. - relies on changes to `Edition.from_isbn()` which attempt to import editions first from `staged` items in the `import_item` table, and second from Amazon via the affiliate server. --------- Co-authored-by: Mek <michael.karpeles@gmail.com>
Closes #8541
Fix.
This PR reenables AMZ imports from
/isbn
.With this PR, the logic upon visiting
/isbn/[some isbn]
is:import_item
table (likely ISBNdb);staged
import in theimport_item
table, and then immediately import itload()
, by way ofImportItem.import_first_staged()
.Technical
This leverages the change in bec372c (hopefully) prevent the race condition that seemed to cause multiple Amazon imports in #6405.
I also want to draw attention to the fact this does not add the proposed
GET
param, as_get_amazon_metadata()
already has what amounts to awhile
loop that retries. As it stands, this PR retries 3 times with a 1 second sleep. I know the issue I created mentioned 5 retries, but my notes said 3. I have no opinion on which is 'better'.Testing
Visit
/isbn/[isbn here]
with an ISBN only in Amazon data (i.e. not in OL already, and not in ISBNdb data that has been staged for import).I attempted to test the race condition using the same method of concurrent requests that formerly added 4-5 items. Now it imports only one.
However, I did notice there is a race condition in
affliate_server.get_current_amazon_batch()
, but fixing it is I think out of scope of this issue, especially as it probably requires a change to theimport_item
schema to add aUNIQUE
constraint (if it is desired these row names always be unique).Having said that, after 10 concurrent requests to the
/isbn
endpoint, here's theimport_item
table:And here's
import_batch
, showing the potential for a race:Stakeholders
@mekarpeles
@tfmorris