Fix .class and #id matching inside quoted attribute values#105
Conversation
The regex-based .class and #id parsers in AttributeParser were matching
dots and hashes inside quoted attribute values, e.g. {include="note.dj"}
would produce a spurious class="dj" alongside include="note.dj".
The reference JS implementation (jgm/djot) avoids this because its
state machine naturally skips quoted content. The fix strips quoted
values before running the .class and #id regexes, consistent with
how boolean attributes are already handled in the same file.
Affected methods: parse() and applyToNode().
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #105 +/- ##
=========================================
Coverage 93.55% 93.55%
- Complexity 2315 2316 +1
=========================================
Files 79 79
Lines 6113 6117 +4
=========================================
+ Hits 5719 5723 +4
Misses 394 394 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
)" This reverts commit cb936d0.
There was a problem hiding this comment.
Pull request overview
Fixes a parsing bug where .class / #id shorthand was incorrectly detected inside quoted attribute values (e.g. include="note.dj"), leading to spurious class/id attributes in rendered output.
Changes:
- Mask quoted substrings before running
.class/#idregex matching inAttributeParser::parse(). - Refactor
AttributeParser::applyToNode()to parse shorthand on a stripped string while parsingkey=valueon the original string. - Add PHPUnit coverage to ensure dots/hashes in quoted values don’t produce class/id shorthand matches, including converter-level integration checks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/Parser/Utility/AttributeParser.php |
Strips quoted values prior to shorthand regex matching; refactors applyToNode() parsing flow. |
tests/TestCase/Parser/AttributeParserTest.php |
Adds unit + integration tests for dots/hashes inside quoted attribute values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Strip quoted values before matching .class and #id to avoid | ||
| // matching dots/hashes inside attribute values like key="file.txt" | ||
| $strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr; | ||
| $strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand; | ||
|
|
||
| // Parse .class | ||
| if (preg_match_all('/\.([^\s.#=}]+)/', $attrStr, $classMatches)) { | ||
| if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) { | ||
| $attributes['class'] = implode(' ', $classMatches[1]); | ||
| } | ||
|
|
||
| // Parse #id | ||
| if (preg_match('/#([^\s.#=}]+)/', $attrStr, $idMatch)) { | ||
| if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) { | ||
| $attributes['id'] = $idMatch[1]; |
There was a problem hiding this comment.
$strippedForShorthand only removes quoted substrings, so .class / #id matching can still fire inside unquoted attribute values (e.g. include=note.dj would still match .dj and create a spurious class="dj"). Since parse() explicitly supports key=unquoted, consider stripping/skipping full key=value pairs (quoted and unquoted) before running the shorthand regexes (similar to the boolean-attr stripping below), or switch to a single tokenization pass that ignores values entirely when scanning for shorthand.
| // Strip quoted values before matching .class and #id to avoid | ||
| // matching dots/hashes inside attribute values like key="file.txt" | ||
| $strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr; | ||
| $strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand; | ||
|
|
||
| // Parse .class and #id on stripped string | ||
| if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) { | ||
| foreach ($classMatches[1] as $class) { | ||
| $node->addClass($class); | ||
| } | ||
| } | ||
|
|
||
| if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) { | ||
| $node->setAttribute('id', $idMatch[1]); | ||
| } | ||
|
|
||
| // Parse key=value on original string (needs quoted values intact) | ||
| $kvPattern = '/([^\s.#=}]+)="((?:[^"\\\\]|\\\\.)*)"|([^\s.#=}]+)=\'((?:[^\'\\\\]|\\\\.)*)\'' | ||
| . '|([^\s.#=}]+)=([^\s}"\']+)/'; | ||
| preg_match_all($pattern, $attrStr, $matches, PREG_SET_ORDER); | ||
| preg_match_all($kvPattern, $attrStr, $matches, PREG_SET_ORDER); | ||
|
|
||
| foreach ($matches as $match) { | ||
| if (!empty($match[1])) { | ||
| // Class attribute | ||
| $node->addClass($match[1]); | ||
| } elseif (!empty($match[2])) { | ||
| // ID attribute | ||
| $node->setAttribute('id', $match[2]); | ||
| } elseif (($match[3] ?? '') !== '') { | ||
| if (($match[1] ?? '') !== '') { | ||
| // key="double quoted value" | ||
| $node->setAttribute($match[1], self::processEscapes($match[2] ?? '')); | ||
| } elseif (($match[3] ?? '') !== '') { | ||
| // key='single quoted value' | ||
| $node->setAttribute($match[3], self::processEscapes($match[4] ?? '')); | ||
| } elseif (($match[5] ?? '') !== '') { | ||
| // key='single quoted value' | ||
| $node->setAttribute($match[5], self::processEscapes($match[6] ?? '')); | ||
| } elseif (($match[7] ?? '') !== '') { | ||
| // key=unquoted | ||
| $node->setAttribute($match[7], $match[8] ?? ''); | ||
| $node->setAttribute($match[5], $match[6] ?? ''); | ||
| } |
There was a problem hiding this comment.
applyToNode() now parses .class/#id in a separate pass before key=value parsing. Because attribute rendering preserves insertion order (except for moving id first in HtmlRenderer), this can change the order of attributes relative to the source for inputs like {key=val .foo} (class would be inserted before key). To avoid output-order regressions, consider parsing all tokens in a single left-to-right pass (or using a regex strategy like (*SKIP)(*F) to ignore values while still emitting matches in source order) and applying them to the node in match order.
| $result = $this->converter->convert('[link]{ref="/guides/setup.html"}'); | ||
|
|
||
| $this->assertStringContainsString('ref="/guides/setup.html"', $result); | ||
| $this->assertStringNotContainsString('class', $result); |
There was a problem hiding this comment.
The assertion assertStringNotContainsString('class', $result) is overly broad and can false-fail if the output contains the substring class outside of a generated class attribute. It would be more precise to assert that class=" (or class=) is absent, so the test only fails when a class attribute is actually emitted.
| $this->assertStringNotContainsString('class', $result); | |
| $this->assertStringNotContainsString('class="', $result); |
PR #105 introduced a regression where .class and #id shorthand were always parsed before key=value pairs, causing attributes to appear in a different order than the source. The reference JS implementation (jgm/djot) preserves source order: {key=val .foo} → key="val" class="foo" Before this fix: {key=val .foo} → class="foo" key="val" (wrong) After this fix: {key=val .foo} → key="val" class="foo" (correct) Changes: - Rewrote applyToNode() to use a single-pass regex that matches all token types in source order (quoted values, unquoted values, .class, #id, boolean attributes) - Added skip pattern for invalid unquoted values (e.g. key=foo.bar) to prevent dots from being matched as .class shorthand - Fixed test assertion: 'class' → 'class="' (per Copilot review) - Added 6 new tests for attribute ordering Note: HtmlRenderer intentionally moves 'id' to the front of attribute lists - this is pre-existing behavior and not addressed here.
PR #105 introduced a regression where .class and #id shorthand were always parsed before key=value pairs, causing attributes to appear in a different order than the source. The reference JS implementation (jgm/djot) preserves source order: {key=val .foo} → key="val" class="foo" Before this fix: {key=val .foo} → class="foo" key="val" (wrong) After this fix: {key=val .foo} → key="val" class="foo" (correct) Changes: - Rewrote applyToNode() to use a single-pass regex that matches all token types in source order (quoted values, unquoted values, .class, #id, boolean attributes) - Added skip pattern for invalid unquoted values (e.g. key=foo.bar) to prevent dots from being matched as .class shorthand - Fixed test assertion: 'class' → 'class="' (per Copilot review) - Added 6 new tests for attribute ordering Note: HtmlRenderer intentionally moves 'id' to the front of attribute lists - this is pre-existing behavior and not addressed here.
Hey! I found this while building a documentation tool on top of djot-php.
When an attribute value contains a dot or hash — e.g. {include="note.dj"} — the .class regex matches .dj" inside the quoted string, producing a spurious class="dj" alongside the intended include="note.dj".
Repro:
[foo]{include="bar.dj"} // Before fix: <p><span class="dj"" include="bar.dj">foo</span></p> // After fix: <p><span include="bar.dj">foo</span></p>Root cause: The .class and #id regexes run over the raw attribute string without skipping quoted values. The reference JS implementation doesn't have this issue because its state machine naturally ignores content inside quotes.
Fix: Strip quoted values before running the .class and #id regexes — the same pattern already used for boolean attribute matching in the same file. Minimal change, no architectural changes.