fix(dianping/shop): correct in-browser name and reviews extraction#1312
Merged
fix(dianping/shop): correct in-browser name and reviews extraction#1312
Conversation
The merged adapter had two silent in-browser bugs that the mocked-evaluate unit tests don't catch — only live verify against www.dianping.com surfaces them: 1. Shop name returned `undefined`. The fallback parsed `document.title` with an ASCII-bracket split (`/[\\[\\]]/`) but dianping wraps the name in full-width brackets `【芈重山老火锅(五道口店)】...`. Switch to a `【...】` regex so the title fallback actually fires. 2. Reviews returned `5` instead of `21241`. The headText was whitespace- collapsed to `★★★★★4.821241条...`, fusing the rating and review digits; a head-wide `/\d+条/` then captured `4.821241` and rounded to `5`. Read the dedicated `.reviews / .review-num` element ("21241条") instead, with a `.review-title` "评价(<n>)" fallback.
This was referenced May 4, 2026
jackwener
added a commit
that referenced
this pull request
May 4, 2026
…ractors PR #1312 fixed two silent in-browser DOM bugs that the existing mocked `page.evaluate` tests could not catch: 1. shop title fallback split on ASCII `[]` while dianping renders full-width `【】`, so `name` was always empty (or `"undefined"`). 2. headText `\s+` collapse fused rating "4.8" with reviews "21241条", so a head-wide `/\d+条/` regex captured "4.821241" → 5. Both bugs only surfaced on live verify; mocked-evaluate unit tests fed pre-baked results to the func and the real DOM walk never ran. Make the in-browser extractor logic testable in CI: - clis/dianping/shop.js, clis/dianping/search.js: extract the IIFE bodies into top-level `extractShopFields()` / `extractSearchRows()` using bare `document` / `location`. The live adapters inject these via `page.evaluate(\`(\${fn.toString()})()\`)` so behavior is unchanged; both commands re-verified end-to-end against live dianping (shop returns name=芈重山老火锅(五道口店), reviews=21241, rating=4.8; search returns 3 result-shaped rows with correct ids). - clis/dianping/__fixtures__/shop.html (3.4KB), search.html (8.4KB): sanitized HTML snapshots — scripts/styles/iframes/comments stripped, img src placeholdered, only structural attributes kept. Trimmed to the minimum subtree needed to exercise the extractors (search keeps 3 of 15 li cards; shop keeps .shop-head + .desc-info + .review-title plus full-width 【】 title and headText with the rating/reviews fusion preserved). - clis/dianping/dianping.test.js: add a fifth describe block — "extractors against frozen HTML fixtures" — that loads the fixtures via JSDOM, swaps `globalThis.document` / `globalThis.location`, and asserts the post-fix behavior: * shop: name=芈重山老火锅(五道口店), reviewsRaw=21241条, rating=4.8, breakdown={口味:4.8,环境:4.8,服务:4.8,食材:4.9}, hours, rank, subway. * search: 3 rows with correct shop_ids, names, reviewsRaw, priceRaw, starClass; round-trip through parseReviewCount/parsePrice mappers to lock in {rating:5.0,reviews:21231,price:109} et al. * ok:false branches: shop fixture without `.shop-head`, search fixture with empty `#shop-all-list`. Manually verified the fixtures would catch the original bugs by running buggy extractor variants against shop.html — ASCII-bracket fallback returns `name="undefined"`, and head-wide `/\d+条/` returns `821241条` (both fail the new assertions). Test suite grows 14 → 18 passing tests; live verify of both commands still produces correct output post-refactor. Pattern intentionally limited to dianping as a reference point. If other sites with in-browser DOM extraction encounter similar silent bugs, this JSDOM-against-frozen-fixture pattern can be adopted per-site.
jackwener
added a commit
that referenced
this pull request
May 4, 2026
…ractors (#1313) PR #1312 fixed two silent in-browser DOM bugs that the existing mocked `page.evaluate` tests could not catch: 1. shop title fallback split on ASCII `[]` while dianping renders full-width `【】`, so `name` was always empty (or `"undefined"`). 2. headText `\s+` collapse fused rating "4.8" with reviews "21241条", so a head-wide `/\d+条/` regex captured "4.821241" → 5. Both bugs only surfaced on live verify; mocked-evaluate unit tests fed pre-baked results to the func and the real DOM walk never ran. Make the in-browser extractor logic testable in CI: - clis/dianping/shop.js, clis/dianping/search.js: extract the IIFE bodies into top-level `extractShopFields()` / `extractSearchRows()` using bare `document` / `location`. The live adapters inject these via `page.evaluate(\`(\${fn.toString()})()\`)` so behavior is unchanged; both commands re-verified end-to-end against live dianping (shop returns name=芈重山老火锅(五道口店), reviews=21241, rating=4.8; search returns 3 result-shaped rows with correct ids). - clis/dianping/__fixtures__/shop.html (3.4KB), search.html (8.4KB): sanitized HTML snapshots — scripts/styles/iframes/comments stripped, img src placeholdered, only structural attributes kept. Trimmed to the minimum subtree needed to exercise the extractors (search keeps 3 of 15 li cards; shop keeps .shop-head + .desc-info + .review-title plus full-width 【】 title and headText with the rating/reviews fusion preserved). - clis/dianping/dianping.test.js: add a fifth describe block — "extractors against frozen HTML fixtures" — that loads the fixtures via JSDOM, swaps `globalThis.document` / `globalThis.location`, and asserts the post-fix behavior: * shop: name=芈重山老火锅(五道口店), reviewsRaw=21241条, rating=4.8, breakdown={口味:4.8,环境:4.8,服务:4.8,食材:4.9}, hours, rank, subway. * search: 3 rows with correct shop_ids, names, reviewsRaw, priceRaw, starClass; round-trip through parseReviewCount/parsePrice mappers to lock in {rating:5.0,reviews:21231,price:109} et al. * ok:false branches: shop fixture without `.shop-head`, search fixture with empty `#shop-all-list`. Manually verified the fixtures would catch the original bugs by running buggy extractor variants against shop.html — ASCII-bracket fallback returns `name="undefined"`, and head-wide `/\d+条/` returns `821241条` (both fail the new assertions). Test suite grows 14 → 18 passing tests; live verify of both commands still produces correct output post-refactor. Pattern intentionally limited to dianping as a reference point. If other sites with in-browser DOM extraction encounter similar silent bugs, this JSDOM-against-frozen-fixture pattern can be adopted per-site.
5 tasks
jackwener
added a commit
that referenced
this pull request
May 4, 2026
…L fixtures (#1318) dianping/__fixtures__/search.html and shop.html came out of page.content() with hundreds of blank lines that JSDOM ignores during parsing — pure visual / disk noise that bloats reviewer diff and obscures the meaningful DOM subtree the fixture freezes. search.html: 372 → 168 lines (-204 lines, -572 bytes) shop.html: 39 → 6 lines (-33 lines, -64 bytes) `awk 'NF>0'` keeps every line that has any non-whitespace character, so the minified mega-line (where the rating-vs-reviews adjacency that triggers #1312 silent fusion lives) is preserved verbatim. dianping.test.js 18 tests still pass, and reintroducing the buggy `headText.match(/(\d+)条/)` extractor still makes the regression guard fail with the expected '821241条' (proving the fusion-bug detection power is intact after the strip). Per WAWQAQ feedback in #1313 thread; opus independently validated the same approach before the cleanup.
Merged
3 tasks
jackwener
added a commit
that referenced
this pull request
May 4, 2026
…ser DOM extractors (#1319) Codify the JSDOM-against-frozen-fixture pattern that PR #1313 introduced for dianping (and that PR #1318 had to follow up to clean up). The skill previously had no reference for this category of test, so authors of the next adapter that hits silent-in-browser-DOM bugs would either reinvent it or skip it. Key conventions captured: - **Mandatory awk 'NF>0' as the final step of fixture creation.** The blank-line noise that PR #1318 removed (84.6% / 54.8% of file content in dianping/{shop,search}.html) came from manually stripping script/style content without collapsing the surrounding newlines. Skipping this step is the silent quality regression that the next fixture author would also hit. - **Trim-to-minimum but never re-flow content.** Some bugs depend on text-node adjacency without intervening whitespace (dianping #1312 bug #2: rating "4.8" + reviews "21241条" fused as "4.821241条"). Pretty-printing the meaningful mega-line would mask the very condition the test is meant to catch. - **Reverse-validate the regression guard.** "18/18 tests pass" only proves agreement with the current implementation, not that the test would have caught the original bug. Reintroducing the buggy variant must make the test fail — otherwise the fixture is over-stripped or the assertion is too loose. - **__fixtures__/ is the documented exception** to the "no committed HTML dumps" rule in the skill's "关键约定". Calling that out explicitly because the rule otherwise reads as "all HTML in repo is bad," which the dianping fixture pattern intentionally violates for a real reason. Background: WAWQAQ in #1313 follow-up thread (`#OpenCLI:36d2f65a`) asked twice — first about the visible blank-line noise (→ PR #1318 cleanup), then about the root cause and what should improve in the workflow itself. This is the workflow improvement. No new tooling / CI gate / lint introduced (B 1-week gate freeze still applies). When a fifth fixture site adopts this pattern, `opencli browser fixture-snapshot` automation can be revisited; until then, runbook discipline + skill reference is the right scope.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #1309. Two silent in-browser bugs slipped past the mocked-evaluate unit tests — only live verify against www.dianping.com surfaces them. Confirmed fixed by live run against
H20FrTRI6kbXbgTu(芈重山老火锅·五道口店, 21241 reviews, ¥109).namewasundefined. Thedocument.titlefallback split on ASCII brackets/[\[\]]/, but dianping wraps the shop name in full-width【...】. Switched to a【([^】]+)】regex so the title fallback actually fires.reviewswas5instead of21241. The whitespace-collapsedheadTextbecomes★★★★★4.821241条…, fusing the rating with the review digits; a head-wide/\d+条/then captured4.821241and rounded to5. Switched to the dedicated.reviews / .review-numelement ("21241条"), with a.review-title评价(<n>)fallback.Why the unit tests didn't catch this
dianping.test.jsmockspage.evaluate()and feeds it pre-baked extraction results, so it can't exercise the real DOM walk. Adding a JSDOM-against-frozen-fixture test for shop pages would be a worthwhile follow-up; this PR only fixes the bug.Test plan
pnpm vitest run clis/dianping/dianping.test.js --project adapter→ 14/14 passnode dist/src/main.js dianping shop H20FrTRI6kbXbgTu→name=芈重山老火锅(五道口店),reviews=21241,rating=4.8, all auxiliary fields populated