Skip to content

fix(dianping/shop): correct in-browser name and reviews extraction#1312

Merged
jackwener merged 1 commit intomainfrom
fix/dianping-shop-name-reviews
May 4, 2026
Merged

fix(dianping/shop): correct in-browser name and reviews extraction#1312
jackwener merged 1 commit intomainfrom
fix/dianping-shop-name-reviews

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Follow-up to #1309. Two silent in-browser bugs slipped past the mocked-evaluate unit tests — only live verify against www.dianping.com surfaces them. Confirmed fixed by live run against H20FrTRI6kbXbgTu (芈重山老火锅·五道口店, 21241 reviews, ¥109).

  • name was undefined. The document.title fallback split on ASCII brackets /[\[\]]/, but dianping wraps the shop name in full-width 【...】. Switched to a 【([^】]+)】 regex so the title fallback actually fires.
  • reviews was 5 instead of 21241. The whitespace-collapsed headText becomes ★★★★★4.821241条…, fusing the rating with the review digits; a head-wide /\d+条/ then captured 4.821241 and rounded to 5. Switched to the dedicated .reviews / .review-num element ("21241条"), with a .review-title 评价(<n>) fallback.

Why the unit tests didn't catch this

dianping.test.js mocks page.evaluate() and feeds it pre-baked extraction results, so it can't exercise the real DOM walk. Adding a JSDOM-against-frozen-fixture test for shop pages would be a worthwhile follow-up; this PR only fixes the bug.

Test plan

  • pnpm vitest run clis/dianping/dianping.test.js --project adapter → 14/14 pass
  • node dist/src/main.js dianping shop H20FrTRI6kbXbgTuname=芈重山老火锅(五道口店), reviews=21241, rating=4.8, all auxiliary fields populated
  • Reviewer can re-run the live shop command (requires a logged-in dianping cookie + already-cleared captcha)

The merged adapter had two silent in-browser bugs that the mocked-evaluate
unit tests don't catch — only live verify against www.dianping.com surfaces
them:

1. Shop name returned `undefined`. The fallback parsed `document.title` with
   an ASCII-bracket split (`/[\\[\\]]/`) but dianping wraps the name in
   full-width brackets `【芈重山老火锅(五道口店)】...`. Switch to a `【...】`
   regex so the title fallback actually fires.

2. Reviews returned `5` instead of `21241`. The headText was whitespace-
   collapsed to `★★★★★4.821241条...`, fusing the rating and review digits;
   a head-wide `/\d+条/` then captured `4.821241` and rounded to `5`. Read
   the dedicated `.reviews / .review-num` element ("21241条") instead, with
   a `.review-title` "评价(<n>)" fallback.
@jackwener jackwener merged commit c1bbf0b into main May 4, 2026
11 checks passed
jackwener added a commit that referenced this pull request May 4, 2026
…ractors

PR #1312 fixed two silent in-browser DOM bugs that the existing mocked
`page.evaluate` tests could not catch:

  1. shop title fallback split on ASCII `[]` while dianping renders
     full-width `【】`, so `name` was always empty (or `"undefined"`).
  2. headText `\s+` collapse fused rating "4.8" with reviews "21241条",
     so a head-wide `/\d+条/` regex captured "4.821241" → 5.

Both bugs only surfaced on live verify; mocked-evaluate unit tests fed
pre-baked results to the func and the real DOM walk never ran.

Make the in-browser extractor logic testable in CI:

  - clis/dianping/shop.js, clis/dianping/search.js: extract the IIFE
    bodies into top-level `extractShopFields()` / `extractSearchRows()`
    using bare `document` / `location`. The live adapters inject these
    via `page.evaluate(\`(\${fn.toString()})()\`)` so behavior is
    unchanged; both commands re-verified end-to-end against live
    dianping (shop returns name=芈重山老火锅(五道口店), reviews=21241,
    rating=4.8; search returns 3 result-shaped rows with correct ids).

  - clis/dianping/__fixtures__/shop.html (3.4KB), search.html (8.4KB):
    sanitized HTML snapshots — scripts/styles/iframes/comments stripped,
    img src placeholdered, only structural attributes kept. Trimmed to
    the minimum subtree needed to exercise the extractors (search keeps
    3 of 15 li cards; shop keeps .shop-head + .desc-info + .review-title
    plus full-width 【】 title and headText with the rating/reviews
    fusion preserved).

  - clis/dianping/dianping.test.js: add a fifth describe block —
    "extractors against frozen HTML fixtures" — that loads the fixtures
    via JSDOM, swaps `globalThis.document` / `globalThis.location`, and
    asserts the post-fix behavior:
      * shop: name=芈重山老火锅(五道口店), reviewsRaw=21241条, rating=4.8,
        breakdown={口味:4.8,环境:4.8,服务:4.8,食材:4.9}, hours, rank, subway.
      * search: 3 rows with correct shop_ids, names, reviewsRaw, priceRaw,
        starClass; round-trip through parseReviewCount/parsePrice mappers
        to lock in {rating:5.0,reviews:21231,price:109} et al.
      * ok:false branches: shop fixture without `.shop-head`, search
        fixture with empty `#shop-all-list`.

Manually verified the fixtures would catch the original bugs by running
buggy extractor variants against shop.html — ASCII-bracket fallback
returns `name="undefined"`, and head-wide `/\d+条/` returns `821241条`
(both fail the new assertions).

Test suite grows 14 → 18 passing tests; live verify of both commands
still produces correct output post-refactor.

Pattern intentionally limited to dianping as a reference point. If other
sites with in-browser DOM extraction encounter similar silent bugs, this
JSDOM-against-frozen-fixture pattern can be adopted per-site.
jackwener added a commit that referenced this pull request May 4, 2026
…ractors (#1313)

PR #1312 fixed two silent in-browser DOM bugs that the existing mocked
`page.evaluate` tests could not catch:

  1. shop title fallback split on ASCII `[]` while dianping renders
     full-width `【】`, so `name` was always empty (or `"undefined"`).
  2. headText `\s+` collapse fused rating "4.8" with reviews "21241条",
     so a head-wide `/\d+条/` regex captured "4.821241" → 5.

Both bugs only surfaced on live verify; mocked-evaluate unit tests fed
pre-baked results to the func and the real DOM walk never ran.

Make the in-browser extractor logic testable in CI:

  - clis/dianping/shop.js, clis/dianping/search.js: extract the IIFE
    bodies into top-level `extractShopFields()` / `extractSearchRows()`
    using bare `document` / `location`. The live adapters inject these
    via `page.evaluate(\`(\${fn.toString()})()\`)` so behavior is
    unchanged; both commands re-verified end-to-end against live
    dianping (shop returns name=芈重山老火锅(五道口店), reviews=21241,
    rating=4.8; search returns 3 result-shaped rows with correct ids).

  - clis/dianping/__fixtures__/shop.html (3.4KB), search.html (8.4KB):
    sanitized HTML snapshots — scripts/styles/iframes/comments stripped,
    img src placeholdered, only structural attributes kept. Trimmed to
    the minimum subtree needed to exercise the extractors (search keeps
    3 of 15 li cards; shop keeps .shop-head + .desc-info + .review-title
    plus full-width 【】 title and headText with the rating/reviews
    fusion preserved).

  - clis/dianping/dianping.test.js: add a fifth describe block —
    "extractors against frozen HTML fixtures" — that loads the fixtures
    via JSDOM, swaps `globalThis.document` / `globalThis.location`, and
    asserts the post-fix behavior:
      * shop: name=芈重山老火锅(五道口店), reviewsRaw=21241条, rating=4.8,
        breakdown={口味:4.8,环境:4.8,服务:4.8,食材:4.9}, hours, rank, subway.
      * search: 3 rows with correct shop_ids, names, reviewsRaw, priceRaw,
        starClass; round-trip through parseReviewCount/parsePrice mappers
        to lock in {rating:5.0,reviews:21231,price:109} et al.
      * ok:false branches: shop fixture without `.shop-head`, search
        fixture with empty `#shop-all-list`.

Manually verified the fixtures would catch the original bugs by running
buggy extractor variants against shop.html — ASCII-bracket fallback
returns `name="undefined"`, and head-wide `/\d+条/` returns `821241条`
(both fail the new assertions).

Test suite grows 14 → 18 passing tests; live verify of both commands
still produces correct output post-refactor.

Pattern intentionally limited to dianping as a reference point. If other
sites with in-browser DOM extraction encounter similar silent bugs, this
JSDOM-against-frozen-fixture pattern can be adopted per-site.
jackwener added a commit that referenced this pull request May 4, 2026
…L fixtures (#1318)

dianping/__fixtures__/search.html and shop.html came out of page.content() with
hundreds of blank lines that JSDOM ignores during parsing — pure visual / disk
noise that bloats reviewer diff and obscures the meaningful DOM subtree the
fixture freezes.

  search.html: 372 → 168 lines (-204 lines, -572 bytes)
  shop.html:    39 →   6 lines (-33 lines,  -64 bytes)

`awk 'NF>0'` keeps every line that has any non-whitespace character, so the
minified mega-line (where the rating-vs-reviews adjacency that triggers #1312
silent fusion lives) is preserved verbatim. dianping.test.js 18 tests still
pass, and reintroducing the buggy `headText.match(/(\d+)条/)` extractor still
makes the regression guard fail with the expected '821241条' (proving the
fusion-bug detection power is intact after the strip).

Per WAWQAQ feedback in #1313 thread; opus independently validated the same
approach before the cleanup.
jackwener added a commit that referenced this pull request May 4, 2026
…ser DOM extractors (#1319)

Codify the JSDOM-against-frozen-fixture pattern that PR #1313 introduced
for dianping (and that PR #1318 had to follow up to clean up). The skill
previously had no reference for this category of test, so authors of the
next adapter that hits silent-in-browser-DOM bugs would either reinvent
it or skip it.

Key conventions captured:

  - **Mandatory awk 'NF>0' as the final step of fixture creation.** The
    blank-line noise that PR #1318 removed (84.6% / 54.8% of file content
    in dianping/{shop,search}.html) came from manually stripping
    script/style content without collapsing the surrounding newlines.
    Skipping this step is the silent quality regression that the next
    fixture author would also hit.

  - **Trim-to-minimum but never re-flow content.** Some bugs depend on
    text-node adjacency without intervening whitespace
    (dianping #1312 bug #2: rating "4.8" + reviews "21241条" fused as
    "4.821241条"). Pretty-printing the meaningful mega-line would mask
    the very condition the test is meant to catch.

  - **Reverse-validate the regression guard.** "18/18 tests pass" only
    proves agreement with the current implementation, not that the test
    would have caught the original bug. Reintroducing the buggy variant
    must make the test fail — otherwise the fixture is over-stripped or
    the assertion is too loose.

  - **__fixtures__/ is the documented exception** to the "no committed
    HTML dumps" rule in the skill's "关键约定". Calling that out
    explicitly because the rule otherwise reads as "all HTML in repo is
    bad," which the dianping fixture pattern intentionally violates for
    a real reason.

Background: WAWQAQ in #1313 follow-up thread (`#OpenCLI:36d2f65a`) asked
twice — first about the visible blank-line noise (→ PR #1318 cleanup),
then about the root cause and what should improve in the workflow itself.
This is the workflow improvement.

No new tooling / CI gate / lint introduced (B 1-week gate freeze still
applies). When a fifth fixture site adopts this pattern,
`opencli browser fixture-snapshot` automation can be revisited; until
then, runbook discipline + skill reference is the right scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant