Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable Amazon imports from /isbn #8690

Merged
merged 14 commits into from
Feb 13, 2024

Conversation

scottbarnes
Copy link
Collaborator

@scottbarnes scottbarnes commented Jan 5, 2024

Closes #8541

Fix.

This PR reenables AMZ imports from /isbn.

With this PR, the logic upon visiting /isbn/[some isbn] is:

  • (existing functionality) attempt to fetch the book from the OL database;
  • (existing functionality) attempt to fetch the book from the import_item table (likely ISBNdb);
  • (new) attempt to fetch the metadata from the Amazon Products API, clean that metadata for import, add the record as a staged import in the import_item table, and then immediately import it load(), by way of ImportItem.import_first_staged().

Technical

This leverages the change in bec372c (hopefully) prevent the race condition that seemed to cause multiple Amazon imports in #6405.

I also want to draw attention to the fact this does not add the proposed GET param, as _get_amazon_metadata() already has what amounts to a while loop that retries. As it stands, this PR retries 3 times with a 1 second sleep. I know the issue I created mentioned 5 retries, but my notes said 3. I have no opinion on which is 'better'.

Testing

Visit /isbn/[isbn here] with an ISBN only in Amazon data (i.e. not in OL already, and not in ISBNdb data that has been staged for import).

I attempted to test the race condition using the same method of concurrent requests that formerly added 4-5 items. Now it imports only one.

However, I did notice there is a race condition in affliate_server.get_current_amazon_batch(), but fixing it is I think out of scope of this issue, especially as it probably requires a change to the import_item schema to add a UNIQUE constraint (if it is desired these row names always be unique).

Having said that, after 10 concurrent requests to the /isbn endpoint, here's the import_item table:

openlibrary=# SELECT * from import_item;
-[ RECORD 1 ]---------------------------
id          | 39
batch_id    | 12
added_time  | 2024-01-05 05:38:49.103233
import_time | 2024-01-05 05:38:49.76983
status      | created
error       | 
ia_id       | amazon:059035342X
data        | 
ol_key      | /books/OL29M
comments    | 

And here's import_batch, showing the potential for a race:

openlibrary=# SELECT * from import_batch;
 id |    name    | submitter |        submit_time         
----+------------+-----------+----------------------------
 12 | amz-202401 |           | 2024-01-05 05:38:49.101042
 13 | amz-202401 |           | 2024-01-05 05:38:49.101614
 14 | amz-202401 |           | 2024-01-05 05:38:49.101806
 15 | amz-202401 |           | 2024-01-05 05:38:49.102198
(4 rows)

Stakeholders

@mekarpeles
@tfmorris

@scottbarnes scottbarnes marked this pull request as draft January 5, 2024 06:37
@scottbarnes scottbarnes force-pushed the fix-amazon-api branch 2 times, most recently from af040fd to f789ed7 Compare January 5, 2024 17:55
@scottbarnes scottbarnes marked this pull request as ready for review January 5, 2024 18:17
@tfmorris
Copy link
Contributor

tfmorris commented Jan 5, 2024

I know this is still a draft, but I just wanted to comment on this:

fetch the metadata from the Amazon Products API, clean that metadata for import

I agree that any metadata from Amazon needs to be cleaned / filtered, but it seems like a tall task. For example, all these "author" records were created from Amazon for a single book:

OL9952815A Gulsen Heper, Steven D. Levitt, Stephen J. Dubner, Iclal Buyukdevrim Ozcelik
OL9956916A Stephen J. Dubner e Steven D. Levitt
OL10352437A Levitt Stephen D Dubner Steven J
OL10355709A STEVEN D. LEVITT - STEPHEN J. DUBNER
OL10830910A Levitt Steven D Dubner Stephen J
OL10832904A Levitt, Steven D.; Dubner, Stephen J.
OL11491838A Levitt, Steven D., Dubner, Stephen J.
OL11521672A Stephen Dubner Steven Levitt
OL12269988A Dubner, Stephen J., Levitt, Steven D.
OL12646009A Stephen J. Dubner Steven D. Levitt

OpenLibrary already has the book and the authors, so I don't see what value there is in an attempting to untangle the Amazon mess (and I'm not even sure it's possible). Not that some of the names in the author string are translators and that it can also include other superfluous text like "editor" or "trans."

Creating these junk records just creates more work for the librarians.

@scottbarnes
Copy link
Collaborator Author

scottbarnes commented Jan 5, 2024

I think the clean_amazon_metadata_for_load() function here may be a bit of a misnomer, as it is instead taking the response from the Amazon Products API and putting that data into a format where it can be imported.

That said, this will do nothing to address your point that the matching algorithm here seems to be failing miserably once the Amazon (or BWB, or wherever) data is put into the proper format for import. :(

Here, for example, I see that the data that comes back from Amazon for this book (in the unit test) lists "Mary GrandePré" as an author of Harry Potter, rather than the illustrator (for the US Editions). I also see the API is having trouble with the é in GrandPré and instead renders it as _.

I wouldn't call it a bright side to this PR, but at the very least it does stop automatic imports from Amazon. But if someone or something uses the /isbn endpoint and the ISBN isn't found in the OL database, it may very well ultimately import data from Amazon.

@mekarpeles mekarpeles self-assigned this Jan 8, 2024
@scottbarnes scottbarnes added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jan 18, 2024
This PR reenables AMZ imports from `/isbn`.

The logic upon visiting `/isbn/[some isbn]` is now:
- attempt to fetch the book from the OL database;
- attempt to fetch the book from the `import_item` table (likely ISBNdb);
- attempt to fetch the metadata from the Amazon Products API, clean that
  metadata for import, add the as a `staged` import in the `import_item`
  table, and then immediately import it `load()`, by way of
  `ImportItem.import_first_staged()`.

If any of these find or create an edition, the ention is returned.
Import AMZ records as `staged`.
See internetarchive#8541
`/isbn`. E.g., `http://localhost:31337/isbn/059035342X?priority=0`.

A priority of `0` will put the ISBN straight to the front of the queue
for an AMZ Products API look up, and attempt for three seconds to fetch
the cached AMZ data (if it becomes available), returning that data,
marshalled into a form suitable for creating an idition, if possible.
then queue for import and immediately import the item, returning the
resulting `Edition`, or `None`.
This commit:
- adds an `import_missing` parameter to `/api/books` to make the API try
  to import books from ISBN.
- relies on changes to `Edition.from_isbn()` which attempt to import
  editions first from `staged` items in the `import_item` table, and
  second from Amazon via the affiliate server.

This commit likely belongs in a separate PR, but for sake of convienent
testing it is for the moment included in the `/isbn` re-enabling PR.
`queue.PriorityQueue` gives priority to whatever would be returned by
`min([queue_items])`. However, setting `Priority.priority=0` can look a
bit like priority is disabled. Using an `Enum` may help clarify how
priority works, as `Priority.HIGH` has a value of `0`, and is used when
sorting priority items in the queue.
@mekarpeles
Copy link
Member

Q: Let's say patron A adds isbn 123 to the queue under normal priority, what happens if patron B adds the same isbn 123 to the queue as high priority?

a) Does it add a second entry into the queue with different priority but same isbn? This may be disadvantageous (but perhaps not a problem) in terms of additional amz lookups, e.g. imagine the same isbn being passed twice to the same lookup.

b) We should make sure that some version of the isbn does get prioritized in the event that we do unique isbns before they go into the queue. <-- this is the more significant case, i.e. if we prioritize, the isbn should go to front of the line and block.

@mekarpeles
Copy link
Member

What happens if /api/books requests multiple isbns -- and I think, it's not a problem, because our digitization center won't be doing this and that is likely the only thing that will use import_missing

Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; (1) let's change the /isbn to not prioritize by default and thus from_isbn will prioritize=False by default (2) at minimum, add stats.incr to track failures of affiliate server priority case.

@scottbarnes
Copy link
Collaborator Author

scottbarnes commented Feb 1, 2024

Accessing /isbn on the affiliate-server directly

Low priority (the default case): http://localhost:31337/isbn/059035342X?high_priority=false

This will queue an item for lookup on the Amazon Products API and insert returned data (if any) as staged in import_item.

Brower reply: {"status": "submitted", "queue": 1}
Database:

openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]------
id          | 83
batch_id    | 17
added_time  | 2024-02-01 03:13:57.228587
import_time | 
status      | staged
error       | 
ia_id       | amazon:059035342X
data        | {"authors": [{"name": "Rowling, J.K."}, {"name": "GrandPr_, Mary"}], "cover": "https://m.media-amazon.com/images/I/51fLJOHOJFL._SL500_.jpg", "isbn_10": ["059035342X"], "isbn_13": ["9780590353427"], "number_of_pages": 309, "physical_format": "paperback", "publish_date": "Sep 18, 1998", "publishers": ["Scholastic"], "source_records": ["amazon:059035342X"], "title": "Harry Potter and the Sorcerer's Stone"}
ol_key      | 
comments    | 

High priority (block + wait): http://localhost:31337/isbn/059035342X?high_priority=true

The affiliate-server will block+wait and return the data directly to the requesting client; additionally, data is staged for import in import_item.

Browser reply:

{"status": "success", "hit": {"url": "https://www.amazon.com/dp/059035342X/?tag=internetarchi-20", "source_records": ["amazon:059035342X"], "isbn_10": ["059035342X"], "isbn_13": ["9780590353427"], "price": "$5.97", "price_amt": 597, "title": "Harry Potter and the Sorcerer's Stone", "cover": "https://m.media-amazon.com/images/I/51fLJOHOJFL._SL500_.jpg", "authors": [{"name": "Rowling, J.K."}, {"name": "GrandPr_, Mary"}], "publishers": ["Scholastic"], "number_of_pages": 309, "edition_num": "1", "publish_date": "Sep 18, 1998", "product_group": "Book", "physical_format": "paperback"}}

Note: this also inserts a new staged item in import_item prior to returning (i.e. there's a guarantee that by the time the client gets a response, any data has been staged in import_item.

openlibrary=# SELECT * FROM import_item; 
-[ RECORD 1 ]------
id          | 84
batch_id    | 17
added_time  | 2024-02-01 03:18:35.413271
import_time | 
status      | staged
error       | 
ia_id       | amazon:059035342X
data        | {"authors": [{"name": "Rowling, J.K."}, {"name": "GrandPr_, Mary"}], "cover": "https://m.media-amazon.com/images/I/51fLJOHOJFL._SL500_.jpg", "isbn_10": ["059035342X"], "isbn_13": ["9780590353427"], "number_of_pages": 309, "physical_format": "paperback", "publish_date": "Sep 18, 1998", "publishers": ["Scholastic"], "source_records": ["amazon:059035342X"], "title": "Harry Potter and the Sorcerer's Stone"}
ol_key      | 
comments    |

ISBN endpoint at openlibrary.org/isbn

The resolution process is:

  1. Check Open Library's database using the ISBN to find a match;
  2. Check staged items in the import_item table with the ISBN; and finally
  3. Use the affiliate server to either (1) block+wait in the high priority context, using the Amazon Products API to look up and return item data, or (2) queue an ISBN for look up and add any response to import_item as `staged.

Low priority (default case): http://localhost:8080/isbn/059035342X?high_priority=false

If the resolution process gets to the final step, here the affiliate server returns immediately, queues the ISBN for look up, and adds any response to import_item as staged.

In browser:

404 - Page Not Found

/isbn/059035342X does not exist.

In database an item is staged:

openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]----
id          | 85
batch_id    | 17
added_time  | 2024-02-01 03:20:11.750414
import_time | 
status      | staged
error       | 
ia_id       | amazon:059035342X
data        | {"authors": [{"name": "Rowling, J.K."}, {"name": "GrandPr_, Mary"}], "cover": "https://m.media-amazon.com/images/I/51fLJOHOJFL._SL500_.jpg", "isbn_10": ["059035342X"], "isbn_13": ["9780590353427"], "number_of_pages": 309, "physical_format": "paperback", "publish_date": "Sep 18, 1998", "publishers": ["Scholastic"], "source_records": ["amazon:059035342X"], "title": "Harry Potter and the Sorcerer's Stone"}
ol_key      | 
comments    |

Subsequent visit to /isbn for an already-queued low priority item

Here the item is imported directly from the already-staged data in import_item.

Visit http://localhost:8080/isbn/059035342X?high_priority=false ->:

  • the browser displays the page of a newly created edition;
  • the previously staged item has been imported:
openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]---------------------------
id          | 85
batch_id    | 17
added_time  | 2024-02-01 03:20:11.750414
import_time | 2024-02-01 03:24:04.964664
status      | created
error       | 
ia_id       | amazon:059035342X
data        | 
ol_key      | /books/OL68M
comments    |

High priority: http://localhost:8080/isbn/059035342X?high_priority=true

Here the affiliate-server blocks+waits.

Visit http://localhost:8080/isbn/059035342X?high_priority=true ->

  • the API blocks + waits and the browser displays the page of a newly created edition.
  • an item has been staged and imported:
openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]---------------------------
id          | 86
batch_id    | 17
added_time  | 2024-02-01 03:25:55.845604
import_time | 2024-02-01 03:25:56.135687
status      | created
error       | 
ia_id       | amazon:059035342X
data        | 
ol_key      | /books/OL69M
comments    | 

import via /api/books

The flow is much the same. 'high_priority=true', then for any ISBN bib_keys that were not found in the Open Library database, Edition.from_isbn(high_priority=True) is called and the affiliate-server will block + wait until the Amazon Products API responds, and then dynlinks() will use this data to immediate import and use editions.

If 'high_priority=false', then dynlinks() returns whatever it finds in the Open Library database, and also queues a look up via the Amazon Products API, and whatever is found is staged in import_item but no further action is taken.

high_priority=false (the default case) with no matches in Open Library and no staged matches.

Here, neither ISBN has a match in Open Library, so nothing is returned, but both items are queued for lookup via the Amazon Products API and ultimately staged.

Visit http://localhost:8080/api/books.json?bibkeys=059035342X,0312368615&high_priority=false

  • The browser displays the following JSON: {}
  • The AMZ Product API data associated with the two ISBNs are staged:
openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]----
id          | 90
batch_id    | 17
added_time  | 2024-02-01 23:20:37.955542
import_time | 
status      | staged
error       | 
ia_id       | amazon:059035342X
data        | {"authors": [{"name": "Rowling, J.K."}, {"name": "GrandPr_, Mary"}], "cover": "https://m.media-amazon.com/images/I/51fLJOHOJFL._SL500_.jpg", "isbn_10": ["059035342X"], "isbn_13": ["9780590353427"], "number_of_pages": 309, "physical_format": "paperback", "publish_date": "Sep 18, 1998", "publishers": ["Scholastic"], "source_records": ["amazon:059035342X"], "title": "Harry Potter and the Sorcerer's Stone"}
ol_key      | 
comments    | 
-[ RECORD 2 ]----
id          | 91
batch_id    | 17
added_time  | 2024-02-01 23:20:37.955542
import_time | 
status      | staged
error       | 
ia_id       | amazon:0312368615
data        | {"authors": [{"name": "L'Engle, Madeleine"}], "cover": "https://m.media-amazon.com/images/I/41lxKYIsCHL._SL500_.jpg", "isbn_10": ["0312368615"], "isbn_13": ["9780312368616"], "notes": "Source title: Many Waters (A Wrinkle in Time Quintet)", "number_of_pages": 320, "physical_format": "paperback", "publish_date": "May 01, 2007", "publishers": ["Square Fish"], "source_records": ["amazon:0312368615"], "title": "Many Waters"}
ol_key      | 
comments    |

high_priority=false (the default case) with no matches in Open Library, but WITH staged matches.

Here, import_item has ISBN matches for the ISBN bib_keys in the query. Even though high_priority=false, new editions are created and imported for immediate use:

Visit http://localhost:8080/api/books.json?bibkeys=059035342X,0312368615&high_priority=false

  • The browser displays the following JSON:
{
  "059035342X": {
    "bib_key": "059035342X",
    "info_url": "http://localhost:8080/books/OL73M/Harry_Potter_and_the_Sorcerer's_Stone",
    "preview": "noview",
    "preview_url": "http://localhost:8080/books/OL73M/Harry_Potter_and_the_Sorcerer's_Stone",
    "thumbnail_url": "https://covers.openlibrary.org/b/id/57-S.jpg"
  },
  "0312368615": {
    "bib_key": "0312368615",
    "info_url": "http://localhost:8080/books/OL74M/Many_Waters",
    "preview": "noview",
    "preview_url": "http://localhost:8080/books/OL74M/Many_Waters",
    "thumbnail_url": "https://covers.openlibrary.org/b/id/58-S.jpg"
  }
}

The two corresponding staged items in import_item have been updated to created:

openlibrary=# SELECT * FROM import_item;
-[ RECORD 1 ]---------------------------
id          | 90
batch_id    | 17
added_time  | 2024-02-01 23:20:37.955542
import_time | 2024-02-01 23:33:24.648476
status      | created
error       | 
ia_id       | amazon:059035342X
data        | 
ol_key      | /books/OL73M
comments    | 
-[ RECORD 2 ]---------------------------
id          | 91
batch_id    | 17
added_time  | 2024-02-01 23:20:37.955542
import_time | 2024-02-01 23:33:24.942446
status      | created
error       | 
ia_id       | amazon:0312368615
data        | 
ol_key      | /books/OL74M
comments    |

Using high_priority=true immediately imports with block + wait

This blocks + waits on the AMZ affiliate-server side and returns any matches for import in dynlinks() (using Edition.from_isbn(high_priority=True))

Visit: http://localhost:8080/api/books.json?bibkeys=059035342X,0312368615&high_priority=true
The browser displays:

{
  "059035342X": {
    "bib_key": "059035342X",
    "info_url": "http://localhost:8080/books/OL81M/Harry_Potter_and_the_Sorcerer's_Stone",
    "preview": "noview",
    "preview_url": "http://localhost:8080/books/OL81M/Harry_Potter_and_the_Sorcerer's_Stone",
    "thumbnail_url": "https://covers.openlibrary.org/b/id/65-S.jpg"
  },
  "0312368615": {
    "bib_key": "0312368615",
    "info_url": "http://localhost:8080/books/OL82M/Many_Waters",
    "preview": "noview",
    "preview_url": "http://localhost:8080/books/OL82M/Many_Waters",
    "thumbnail_url": "https://covers.openlibrary.org/b/id/66-S.jpg"
  }
}

The database shows:

-[ RECORD 1 ]---------------------------
id          | 102
batch_id    | 17
added_time  | 2024-02-02 00:12:50.39934
import_time | 2024-02-02 00:12:50.69445
status      | created
error       | 
ia_id       | amazon:059035342X
data        | 
ol_key      | /books/OL81M
comments    | 
-[ RECORD 2 ]---------------------------
id          | 103
batch_id    | 17
added_time  | 2024-02-02 00:12:50.713998
import_time | 2024-02-02 00:12:51.059623
status      | created
error       | 
ia_id       | amazon:0312368615
data        | 
ol_key      | /books/OL82M
comments    |

Query: is this networks change correct in compose.production.yaml?

@scottbarnes scottbarnes removed the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Feb 4, 2024
'status': 'staged',
'data': cleaned_metadata,
}
batch = get_current_amazon_batch()
Copy link
Member

@mekarpeles mekarpeles Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of get_current_amazon_batch, let's assume there will be a single batch amz for amazon, keyed by amazon id / isbn.

compose.production.yaml Outdated Show resolved Hide resolved
Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ready to go, merge at you discretion after testing and possibly followup w/ @cdrini re: dbnet

@scottbarnes scottbarnes merged commit 2f59017 into internetarchive:master Feb 13, 2024
3 checks passed
Achorn pushed a commit to Achorn/openlibrary that referenced this pull request Mar 14, 2024
* Renable `/isbn` and AMZ imports

This PR reenables AMZ imports from `/isbn` and `/api/books.json`.

See this comment examples of how to use the endpoints and what
to expect:
internetarchive#8690 (comment)

The logic upon visiting `/isbn/[some isbn]` is now:
- attempt to fetch the book from the OL database;
- attempt to fetch the book from the `import_item` table (likely ISBNdb);
- attempt to fetch the metadata from the Amazon Products API, clean that
  metadata for import, add the as a `staged` import in the `import_item`
  table, and then immediately import it `load()`, by way of
  `ImportItem.import_first_staged()`.

If any of these find or create an edition, the ention is returned.

* Stop bulk imports from AMZ records

Import AMZ records as `staged`.
See internetarchive#8541

* Modify the affiliate server to accept a GET parameter, `high_priority`, at
`/isbn`. E.g., `http://localhost:31337/isbn/059035342X?high_priority=true`.

`high_priority=true` will put the ISBN straight to the front of the queue
for an AMZ Products API look up, and attempt for three seconds to fetch
the cached AMZ data (if it becomes available), returning that data,
marshalled into a form suitable for creating an Edition, if possible.

* Use `high_priority=false` (the default) on the affiliate server to fetch AM
data if available,then queue for import and immediately import the item,
returning the resulting `Edition`, or `None`.

* Feature: `/api/books` will attempt to import from ISBN
- adds an `high_priority` parameter to `/api/books` to make the API try
  to import books from ISBN.
- relies on changes to `Edition.from_isbn()` which attempt to import
  editions first from `staged` items in the `import_item` table, and
  second from Amazon via the affiliate server.

---------

Co-authored-by: Mek <michael.karpeles@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move AMZ retry code from Edition.from_isbn() to affiliate-server
3 participants