imports: add importer for ISBNdb #8511

scottbarnes · 2023-11-09T22:59:01Z

Partially closes #7658

Feature.

Technical

This adds scripts/providers/isbndb.py, which is adapted from scripts/partner_batch_imports.py. It should include:

valid data mapping from an ISBNdb record to an OL record;
a log of where the script is in the list of ISBNdb records, written as import.log;
the ability to gracefully restart (shamelessly taken from partner_batch_imports.py; and
a status of staged in the import_item db.

Issues / room for improvement

This implementation has an impermissible amount of copy/pasted code from partner_batch_imports.py, owing to the desire to have this working quickly and properly. Both files need refactoring to DRY this up. This may also require changes to cron.

Possible bug

A known bug, which I think is also present in partner_batch_imports.py, is that import.log is not updated via update_state() when a particular item can't be imported, so upon resume it will try that item again, print an error to logger.info(), and continue.

Steps for importing from ISBNdb JSONL dump

If using locally, with a theoretical directory named 'isbndb_batches' in the openlibrary root:
that holds an ISBNdb dump named isbndb.jsonl (anything with an isbndb prefix should work), you'd run:

docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches
# Optional, for promise items from https://archive.org/details/bwb_daily_pallets_2023-11-11
docker compose exec -e PYTHONPATH="." web python ./scripts/promise_batch_imports.py config/openlibrary.yml bwb_daily_pallets_2023-11-11
docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all

The first command would load the database with staged ISBNdb items,
and the second (optional) would import promise items and mark relevant ISBNs from the ISBNdb dump as pending, and the third command would start the import script to process them.

Note: in a local environment you may run into permissions issues, so a quick (local only) fix would be chmod -R 777 isbndb_batches, to get around the Docker permissions issues. But note issues chmod 777 brings in terms of global rwx.

Steps for using ISBNdb as a backing store for /isbn

See above, and do the following step, and its prerequesities:

docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches

Then visit /isbn/{some_isbn_here_from_the_isbndb_dump}, and it should import, and the history should note the source as ISBNdb.

How the records look, using a promise item flow (ignore the unfortunate fact this is a CD-ROM)

After initial import, the item is staged:

 8428 |        1 | 2023-11-13 23:10:09.830838 |             | staged |       | idb:9781857580143 | {"authors": [{"name": "Moore, Stephen"}], "isbn_13": ["9781857580143"], "languages": ["eng"], "number_of_pages": 220, "publish_date": "1992", "publishers": ["Letts Educational"], "sourc
e_records": ["idb:9781857580143"], "title": "Revise Sociology (GCSE CD-ROM Revision Guides)"}

Next, after promise_batch_imports.py the item is marked as pending:

openlibrary=# SELECT * FROM import_item WHERE id = 8428;                                                                                      
  id  | batch_id |         added_time         | import_time | status  | error |       ia_id       |                                                                                                                                          data                                           
                                                                                               | ol_key | comments                                                                                                                                                                          
------+----------+----------------------------+-------------+---------+-------+-----------                                                                                                                                                                        
 8428 |        1 | 2023-11-13 23:10:09.830838 |             | pending |       | idb:9781857580143 | {"authors": [{"name": "Moore, Stephen"}], "isbn_13": ["9781857580143"], "languages": ["eng"], "number_of_pages": 220, "publish_date": "1992", "publishers": ["Letts Educational"], "sour
ce_records": ["idb:9781857580143"], "title": "Revise Sociology (GCSE CD-ROM Revision Guides)"} |

Finally, once docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all is run, it's modified:

openlibrary=# SELECT * FROM import_item WHERE id = 8428;               
  id  | batch_id |         added_time         |        import_time         |  status  | error |       ia_id       | data |     ol_key     | comments                                                                                                                                        
------+----------+----------------------------+----------------------------+----------+--                                                                                                      
 8428 |        1 | 2023-11-13 23:10:09.830838 | 2023-11-13 23:29:03.733418 | modified |       | idb:9781857580143 |      | /books/OL3888M |

Testing

There are minimal unit tests. I can include more output if it is useful.

The above database query shows the results, with the status being staged, and the data being unmarshalled into a format that is suitable for OL import.

If we DRY up the parse_data and /api/import path in plugins.importapi.ImportAPI then let's make sure that /api/imports works still
Also that /isbn works (does not show a generic stack trace / 12093810923.html error page) with a record missing obvious data (e.g. no isbn, title, authors, etc)

Stakeholders

@mekarpeles

docker/ol-importbot-start.sh

openlibrary/core/models.py

mekarpeles · 2023-11-14T23:00:15Z

openlibrary/templates/history/comment.html

@@ -12,9 +12,9 @@
    $ record = get_source_record(record_id)
    $if v.revision == 1:
       $ record_type = ''
-       $if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item'):
+       $if record.source_name not in ('amazon.com', 'Better World Books', 'Promise Item', 'ISBNdb'):


We still have bug #2643

scripts/promise_batch_imports.py

vendor/infogami

Change `ImportItem.find_pending()` so that it returns a `map` if and only if the `map` is not empty, and otherwise return `None`. Without this, `manage_imports.import_all` doesn't sleep, which: 1. Causes text to scroll by faster than can possibly be ready, and 2. Causes load averages to spike, and consumes the entire CPU of one core (in my local dev environment, anyway). Currently, the conditional for when to sleep for 60 seconds after checking for a batch and finding nothing is never true, because it returned a `map`, which is always truthy: ``` >>> m = map(len, []) >>> bool(m) True >>> next(m) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration ```

If using locally, with a theoretical directory named 'isbndb_batches' that holds an ISBNdb dump named `isbndb.jsonl`, you'd run: ``` docker compose exec -e PYTHONPATH="." web python ./scripts/providers/isbndb.py config/openlibrary.yml isbndb_batches docker compose exec -e PYTHONPATH="." web python ./scripts/manage_imports.py --config config/openlibrary.yml import-all ``` The first command would load the database with `staged` ISBNdb items, and the second command would start the import script to process them. Note: in a local environment you may run into permissions issues, so a quick (local only) fix would be `chmod -R 777 isbndb_batches`, to get around the Docker permissions issues. But note issues `chmod 777` brings in terms of global `rwx`.

…port source Note: this does not yet reenable AMZ as a source for Edition.from_isbn, which `/isbn` imports use.

tfmorris

I don't think we want to be adding additional low quality data sources when the users are still struggling to clean up the last enormous dump of poor quality data.

The garbage in the example data amply illustrates just how bad this is. GIGO

tfmorris · 2023-11-15T15:57:19Z

scripts/tests/test_isbndb.py

+    'msrp': '0.00',
+    'title': '確定申告、住宅ローン控除とは？',
+    'isbn13': '9780000002259',
+    'authors': ['田中 卓也 ~autofilled~'],


Autofilled?

tfmorris · 2023-11-15T15:58:18Z

scripts/tests/test_isbndb.py

+    'msrp': '1.99',
+    'image': 'Https://images.isbndb.com/covers/01/01/9780000000101.jpg',
+    'pages': 8,
+    'title': 'Nga Aboriginal Art Cal 2000',


An aboriginal art calendar isn't a book

tfmorris · 2023-11-15T15:58:54Z

scripts/tests/test_isbndb.py

+    'pages': 8,
+    'title': 'Nga Aboriginal Art Cal 2000',
+    'isbn13': '9780000000101',
+    'authors': ['Nelson, Bob, Ph.D.'],


Why is an inspirational speaker authoring an art calendar with the subject of "mushrooms"?

tfmorris · 2023-11-15T16:01:40Z

scripts/tests/test_isbndb.py

+    'edition': '1',
+    'language': 'en',
+    'subjects': ['PQ', '878'],
+    'synopsis': 'Francesco Petrarca.',


The synopsis is the name of an Italian renaissance humanist? (and when the Japanese title references bridal gown trends)

scripts/tests/test_isbndb.py

This script was renamed to make it easier to import from, but it turns out for now this is not necessary. Formerly the `do_import` function was being imported into core.models.Edition.from_isbn(), but now that imports from `load()` directly, so the rename is not necessary.

This functionality will be moved into the affiliate server.

openlibrary/core/models.py

mekarpeles · 2023-11-15T20:51:35Z

scripts/manage-imports.py

-    ol.autologin()
+    if os.getenv('LOCAL_DEV'):
+        ol = OpenLibrary(base_url="http://localhost:8080")
+        ol.login("admin", "admin123")


@cdrini, @scottbarnes mentioned we may want to turn these creds for localhost which are also used in copydocs and make them environment variables in docker rather than coding them here. (for another PR)

scripts/providers/isbndb.py

To validate data for parse_data(), we fill in some dummy data, and then remove it as soon as parse_data() runs. But this means if anyone wants to call load(), they need to call parse_data() to get the rec appropriate for load(), then 'manually' remove the dummy data. This commit moves the cleaning of the dummy data to catalog.addbook inside the normalize_import_record() function, which load() calls.

An alternative approach to reducing the business logic in Edition.from_isbn, which also stops the race condition that can occur at /isbn when identical concurrent requests are made to /isbn for the same ISBN.

mekarpeles

Most recent changes lgtm

mekarpeles self-assigned this Nov 13, 2023

mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Nov 13, 2023

mekarpeles requested changes Nov 14, 2023

View reviewed changes

scottbarnes commented Nov 14, 2023

View reviewed changes

vendor/infogami Outdated Show resolved Hide resolved

scottbarnes added 4 commits November 14, 2023 23:21

imports: add importer for ISBNdb

3f7db6b

Rename manage-imports.py to manage_imports.py so it can be imported

2dc5c11

imports: make mange_imports.import_all work locally

9db3652

scottbarnes force-pushed the bulk-import-isbndb branch 2 times, most recently from 3a50e7e to 0499ad9 Compare November 15, 2023 15:23

scottbarnes added 2 commits November 15, 2023 07:40

imports: enable Edition.from_isbn() to use import_item table as an im…

84cc4ed

…port source Note: this does not yet reenable AMZ as a source for Edition.from_isbn, which `/isbn` imports use.

scottbarnes force-pushed the bulk-import-isbndb branch from 68c5995 to 0b1c06b Compare November 15, 2023 15:50

tfmorris suggested changes Nov 15, 2023

View reviewed changes

scottbarnes added 4 commits November 15, 2023 08:16

imports: make /isbn use load() directly, rather than /api/import

c4eebe6

Revert mistaken inclusion of an infogami change

42b661d

imports: remove retry parameter from Edition.from_isbn()

bec372c

This functionality will be moved into the affiliate server.

scottbarnes force-pushed the bulk-import-isbndb branch from 0b1c06b to bec372c Compare November 15, 2023 16:18

mekarpeles reviewed Nov 15, 2023

View reviewed changes

openlibrary/core/models.py Outdated Show resolved Hide resolved

mekarpeles reviewed Nov 15, 2023

View reviewed changes

openlibrary/core/models.py Outdated Show resolved Hide resolved

mekarpeles reviewed Nov 15, 2023

View reviewed changes

scripts/providers/isbndb.py Outdated Show resolved Hide resolved

mekarpeles reviewed Nov 15, 2023

View reviewed changes

scripts/providers/isbndb.py Outdated Show resolved Hide resolved

scottbarnes force-pushed the bulk-import-isbndb branch 4 times, most recently from 60530b3 to 25f6f51 Compare November 16, 2023 18:50

scottbarnes added 2 commits November 16, 2023 11:10

move get_marc21_language() to utils

69cb6f2

scottbarnes force-pushed the bulk-import-isbndb branch 2 times, most recently from 52a4a5a to c2b0e0c Compare November 16, 2023 19:19

imports: block the import race condition at /isbn

f479cae

An alternative approach to reducing the business logic in Edition.from_isbn, which also stops the race condition that can occur at /isbn when identical concurrent requests are made to /isbn for the same ISBN.

scottbarnes force-pushed the bulk-import-isbndb branch from 7a4ef39 to f479cae Compare November 16, 2023 19:34

scottbarnes added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Nov 18, 2023

mekarpeles approved these changes Nov 18, 2023

View reviewed changes

mekarpeles merged commit 05eca7c into internetarchive:master Nov 18, 2023
3 checks passed

scottbarnes deleted the bulk-import-isbndb branch November 18, 2023 20:58

scottbarnes mentioned this pull request Nov 18, 2023

Multiple Amazon records imported 5 times apiece creating duplicate editions #6405

Closed

scottbarnes mentioned this pull request Nov 30, 2023

Modify /api/books to use similar logic to /isbn #8574

Closed

jimchamp removed the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imports: add importer for ISBNdb #8511

imports: add importer for ISBNdb #8511

scottbarnes commented Nov 9, 2023 •

edited

mekarpeles Nov 14, 2023

tfmorris left a comment

tfmorris Nov 15, 2023

tfmorris Nov 15, 2023

tfmorris Nov 15, 2023

tfmorris Nov 15, 2023

mekarpeles Nov 15, 2023

mekarpeles left a comment

imports: add importer for ISBNdb #8511

imports: add importer for ISBNdb #8511

Conversation

scottbarnes commented Nov 9, 2023 • edited

Technical

Issues / room for improvement

Possible bug

Steps for importing from ISBNdb JSONL dump

Steps for using ISBNdb as a backing store for /isbn

How the records look, using a promise item flow (ignore the unfortunate fact this is a CD-ROM)

Testing

Stakeholders

mekarpeles Nov 14, 2023

Choose a reason for hiding this comment

tfmorris left a comment

Choose a reason for hiding this comment

tfmorris Nov 15, 2023

Choose a reason for hiding this comment

tfmorris Nov 15, 2023

Choose a reason for hiding this comment

tfmorris Nov 15, 2023

Choose a reason for hiding this comment

tfmorris Nov 15, 2023

Choose a reason for hiding this comment

mekarpeles Nov 15, 2023

Choose a reason for hiding this comment

mekarpeles left a comment

Choose a reason for hiding this comment

scottbarnes commented Nov 9, 2023 •

edited