Fix Amazon ASIN import failing to match existing editions#12360
Fix Amazon ASIN import failing to match existing editions#12360
Conversation
…ad() Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/27cf8652-e5d0-494b-b6e8-20c663d9366d Co-authored-by: mekarpeles <978325+mekarpeles@users.noreply.github.com>
Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/27cf8652-e5d0-494b-b6e8-20c663d9366d Co-authored-by: mekarpeles <978325+mekarpeles@users.noreply.github.com>
| if not edition_pool: | ||
| # No match candidates found, add edition | ||
| return load_data(rec, account_key=account_key, save=save) | ||
| match = find_match(rec, edition_pool) |
There was a problem hiding this comment.
@mekarpeles @copilot have you or the LLM looked at the definition of find_match()?
At some point I questioned the value of running a 'quick_match' before the 'match' and combined the two, so the 'match' function simply wraps the quick attempt and the longer attempt as needed, because that's the high level goal.
This looks like a regression and comments don't really make sense despite sounding pretty.
Swapping find_match with find_quick_match and find_threshold_match don't make sense when you understand that find_match tries BOTH find_quick_match and find_threshold_match
There was a problem hiding this comment.
Good catch. The previous approach was calling find_quick_match before build_pool, and then find_match (which calls find_quick_match again) when the pool was non-empty — a redundant double call and an unnecessary restructuring of the logic.
Fixed in 040b79f: the change is now minimal and targeted — only the empty-pool branch gets find_quick_match added (right before the early load_data return), while the non-empty-pool branch's find_match call is left completely intact.
There was a problem hiding this comment.
@copilot can you simplify the code further to remove the check for an empty edition_pool and just call find_match() since having an empty build pool doesn't necessarily mean find_match won't find a good match?
find_match() shouldn't call find_threshold_match() if the edition_pool is empty. I'm not sure the return type of an empty edition_pool. I hope it's an empty list.
Am I vibe coding now? It involves a lot of typing, and I feel like I'm missing an opportunity to educate real collaborators who could remember this stuff and use it next time they need to make a code change :c
There was a problem hiding this comment.
Simplified in 3b3e577. find_threshold_match iterates over edition_pool.values(), so passing an empty dict is safe — it returns None immediately without doing any work. This means we can call find_match(rec, edition_pool) unconditionally and drop the empty-pool branch entirely:
edition_pool = build_pool(rec)
match = find_match(rec, edition_pool)
if not match:
return load_data(rec, account_key=account_key, save=save)find_match already handles everything: find_quick_match fires first (catches ASIN/OCAID/source record), and find_threshold_match only runs when the pool is non-empty.
There was a problem hiding this comment.
@copilot Well done! You added functionality by removing code! It's almost like you "deleted the bug". This is often considered a sign of a "good" developer.
Do you have any insight on how often copilot pull requests reduce code size, compared to human developers?
Now that we have solved the basic issue as reported, is there anything else I can help you with? Perhaps we can review the line:
edition_pool = build_pool(rec)
and imagine you are a senior software architect who has to maintain this code for longer than the duration of this textual interaction; where should this line live? Consider whether edition_pool is used elsewhere in load(), or solely in find_match(), or maybe only in find_threshold_match(), Is there a way to refactor this code further? What is the term for the principle involved?
Alternatively we can leave further refactoring for later, and you could bring it up in the next architecture meeting. :seahorse:
…atch intact for non-empty-pool Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/d0b29d14-b3e7-4341-97a0-50f372b7730f Co-authored-by: hornc <905545+hornc@users.noreply.github.com>
…ranch Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/cf9fb966-bd9e-4586-8d9f-761639373198 Co-authored-by: hornc <905545+hornc@users.noreply.github.com>
Move `build_pool(rec)` from the call-site in `load()` into `find_match()` itself, removing the `edition_pool` parameter from the public API. `edition_pool` is computed solely from `rec` and is consumed only inside `find_threshold_match` — it is an implementation detail of the matching process. `load()` was acting as an unnecessary middleman, constructing a value it never uses independently and immediately handing it straight through to `find_match`. Encapsulating that construction inside `find_match` applies the information-hiding principle: callers do not need to know that a pool exists, let alone how to build one. Update the one test (`test_find_match_title_only_promiseitem_against_noisbn_marc`) that called `find_match` directly with a manually-constructed pool; after the change `build_pool` computes the same pool naturally from the saved mock data. Agent-Logs-Url: https://github.com/internetarchive/openlibrary/sessions/49852563-988b-4cee-96c5-56acc005819c Co-authored-by: hornc <905545+hornc@users.noreply.github.com>
test_load_matches_existing_edition_on_asin_when_pool_is_emptybuild_poolinsidefind_match— remove leakyedition_poolparameter (information hiding)test_find_match_title_only_promiseitem_against_noisbn_marcto callfind_match(rec)directly