Skip to content

Fix Amazon ASIN import failing to match existing editions#12360

Draft
Copilot wants to merge 6 commits intomasterfrom
copilot/fix-asins-matching-issue
Draft

Fix Amazon ASIN import failing to match existing editions#12360
Copilot wants to merge 6 commits intomasterfrom
copilot/fix-asins-matching-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

  • Fix ASIN import matching bug (remove empty-pool early exit)
  • Add regression test test_load_matches_existing_edition_on_asin_when_pool_is_empty
  • Encapsulate build_pool inside find_match — remove leaky edition_pool parameter (information hiding)
  • Update test_find_match_title_only_promiseitem_against_noisbn_marc to call find_match(rec) directly
  • Validation: syntax check, CodeQL scan, code review — all pass

Copilot AI linked an issue Apr 13, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Fix import failing to match on amazon ASIN Fix Amazon ASIN import failing to match existing editions Apr 13, 2026
Copilot AI requested a review from mekarpeles April 13, 2026 18:40
if not edition_pool:
# No match candidates found, add edition
return load_data(rec, account_key=account_key, save=save)
match = find_match(rec, edition_pool)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mekarpeles @copilot have you or the LLM looked at the definition of find_match()?

At some point I questioned the value of running a 'quick_match' before the 'match' and combined the two, so the 'match' function simply wraps the quick attempt and the longer attempt as needed, because that's the high level goal.

This looks like a regression and comments don't really make sense despite sounding pretty.

Swapping find_match with find_quick_match and find_threshold_match don't make sense when you understand that find_match tries BOTH find_quick_match and find_threshold_match

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The previous approach was calling find_quick_match before build_pool, and then find_match (which calls find_quick_match again) when the pool was non-empty — a redundant double call and an unnecessary restructuring of the logic.

Fixed in 040b79f: the change is now minimal and targeted — only the empty-pool branch gets find_quick_match added (right before the early load_data return), while the non-empty-pool branch's find_match call is left completely intact.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can you simplify the code further to remove the check for an empty edition_pool and just call find_match() since having an empty build pool doesn't necessarily mean find_match won't find a good match?

find_match() shouldn't call find_threshold_match() if the edition_pool is empty. I'm not sure the return type of an empty edition_pool. I hope it's an empty list.

Am I vibe coding now? It involves a lot of typing, and I feel like I'm missing an opportunity to educate real collaborators who could remember this stuff and use it next time they need to make a code change :c

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in 3b3e577. find_threshold_match iterates over edition_pool.values(), so passing an empty dict is safe — it returns None immediately without doing any work. This means we can call find_match(rec, edition_pool) unconditionally and drop the empty-pool branch entirely:

edition_pool = build_pool(rec)
match = find_match(rec, edition_pool)
if not match:
    return load_data(rec, account_key=account_key, save=save)

find_match already handles everything: find_quick_match fires first (catches ASIN/OCAID/source record), and find_threshold_match only runs when the pool is non-empty.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Well done! You added functionality by removing code! It's almost like you "deleted the bug". This is often considered a sign of a "good" developer.

Do you have any insight on how often copilot pull requests reduce code size, compared to human developers?

Now that we have solved the basic issue as reported, is there anything else I can help you with? Perhaps we can review the line:

  edition_pool = build_pool(rec)

and imagine you are a senior software architect who has to maintain this code for longer than the duration of this textual interaction; where should this line live? Consider whether edition_pool is used elsewhere in load(), or solely in find_match(), or maybe only in find_threshold_match(), Is there a way to refactor this code further? What is the term for the principle involved?

Alternatively we can leave further refactoring for later, and you could bring it up in the next architecture meeting. :seahorse:

…atch intact for non-empty-pool

Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/d0b29d14-b3e7-4341-97a0-50f372b7730f

Co-authored-by: hornc <905545+hornc@users.noreply.github.com>
Move `build_pool(rec)` from the call-site in `load()` into `find_match()`
itself, removing the `edition_pool` parameter from the public API.

`edition_pool` is computed solely from `rec` and is consumed only inside
`find_threshold_match` — it is an implementation detail of the matching
process. `load()` was acting as an unnecessary middleman, constructing a
value it never uses independently and immediately handing it straight through
to `find_match`. Encapsulating that construction inside `find_match` applies
the information-hiding principle: callers do not need to know that a pool
exists, let alone how to build one.

Update the one test (`test_find_match_title_only_promiseitem_against_noisbn_marc`)
that called `find_match` directly with a manually-constructed pool; after the
change `build_pool` computes the same pool naturally from the saved mock data.

Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/49852563-988b-4cee-96c5-56acc005819c

Co-authored-by: hornc <905545+hornc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import failing to match on amazon ASIN

3 participants