Skip to content

Fix .class and #id matching inside quoted attribute values#105

Merged
dereuromark merged 1 commit intophp-collective:masterfrom
danielledeleo:fix/attribute-parser-dot-in-quoted-values
Mar 23, 2026
Merged

Fix .class and #id matching inside quoted attribute values#105
dereuromark merged 1 commit intophp-collective:masterfrom
danielledeleo:fix/attribute-parser-dot-in-quoted-values

Conversation

@danielledeleo
Copy link
Copy Markdown
Contributor

Hey! I found this while building a documentation tool on top of djot-php.

When an attribute value contains a dot or hash — e.g. {include="note.dj"} — the .class regex matches .dj" inside the quoted string, producing a spurious class="dj" alongside the intended include="note.dj".

Repro:

[foo]{include="bar.dj"}

// Before fix: <p><span class="dj&quot;" include="bar.dj">foo</span></p>
// After fix:  <p><span include="bar.dj">foo</span></p>

Root cause: The .class and #id regexes run over the raw attribute string without skipping quoted values. The reference JS implementation doesn't have this issue because its state machine naturally ignores content inside quotes.

Fix: Strip quoted values before running the .class and #id regexes — the same pattern already used for boolean attribute matching in the same file. Minimal change, no architectural changes.

The regex-based .class and #id parsers in AttributeParser were matching
dots and hashes inside quoted attribute values, e.g. {include="note.dj"}
would produce a spurious class="dj" alongside include="note.dj".

The reference JS implementation (jgm/djot) avoids this because its
state machine naturally skips quoted content. The fix strips quoted
values before running the .class and #id regexes, consistent with
how boolean attributes are already handled in the same file.

Affected methods: parse() and applyToNode().
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.55%. Comparing base (fd9207c) to head (af64c55).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master     #105   +/-   ##
=========================================
  Coverage     93.55%   93.55%           
- Complexity     2315     2316    +1     
=========================================
  Files            79       79           
  Lines          6113     6117    +4     
=========================================
+ Hits           5719     5723    +4     
  Misses          394      394           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dereuromark dereuromark requested a review from Copilot March 23, 2026 05:02
@dereuromark dereuromark merged commit cb936d0 into php-collective:master Mar 23, 2026
8 checks passed
dereuromark added a commit that referenced this pull request Mar 23, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a parsing bug where .class / #id shorthand was incorrectly detected inside quoted attribute values (e.g. include="note.dj"), leading to spurious class/id attributes in rendered output.

Changes:

  • Mask quoted substrings before running .class / #id regex matching in AttributeParser::parse().
  • Refactor AttributeParser::applyToNode() to parse shorthand on a stripped string while parsing key=value on the original string.
  • Add PHPUnit coverage to ensure dots/hashes in quoted values don’t produce class/id shorthand matches, including converter-level integration checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Parser/Utility/AttributeParser.php Strips quoted values prior to shorthand regex matching; refactors applyToNode() parsing flow.
tests/TestCase/Parser/AttributeParserTest.php Adds unit + integration tests for dots/hashes inside quoted attribute values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +35 to 47
// Strip quoted values before matching .class and #id to avoid
// matching dots/hashes inside attribute values like key="file.txt"
$strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr;
$strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand;

// Parse .class
if (preg_match_all('/\.([^\s.#=}]+)/', $attrStr, $classMatches)) {
if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) {
$attributes['class'] = implode(' ', $classMatches[1]);
}

// Parse #id
if (preg_match('/#([^\s.#=}]+)/', $attrStr, $idMatch)) {
if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) {
$attributes['id'] = $idMatch[1];
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$strippedForShorthand only removes quoted substrings, so .class / #id matching can still fire inside unquoted attribute values (e.g. include=note.dj would still match .dj and create a spurious class="dj"). Since parse() explicitly supports key=unquoted, consider stripping/skipping full key=value pairs (quoted and unquoted) before running the shorthand regexes (similar to the boolean-attr stripping below), or switch to a single tokenization pass that ignores values entirely when scanning for shorthand.

Copilot uses AI. Check for mistakes.
Comment on lines +115 to 146
// Strip quoted values before matching .class and #id to avoid
// matching dots/hashes inside attribute values like key="file.txt"
$strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr;
$strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand;

// Parse .class and #id on stripped string
if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) {
foreach ($classMatches[1] as $class) {
$node->addClass($class);
}
}

if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) {
$node->setAttribute('id', $idMatch[1]);
}

// Parse key=value on original string (needs quoted values intact)
$kvPattern = '/([^\s.#=}]+)="((?:[^"\\\\]|\\\\.)*)"|([^\s.#=}]+)=\'((?:[^\'\\\\]|\\\\.)*)\''
. '|([^\s.#=}]+)=([^\s}"\']+)/';
preg_match_all($pattern, $attrStr, $matches, PREG_SET_ORDER);
preg_match_all($kvPattern, $attrStr, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
if (!empty($match[1])) {
// Class attribute
$node->addClass($match[1]);
} elseif (!empty($match[2])) {
// ID attribute
$node->setAttribute('id', $match[2]);
} elseif (($match[3] ?? '') !== '') {
if (($match[1] ?? '') !== '') {
// key="double quoted value"
$node->setAttribute($match[1], self::processEscapes($match[2] ?? ''));
} elseif (($match[3] ?? '') !== '') {
// key='single quoted value'
$node->setAttribute($match[3], self::processEscapes($match[4] ?? ''));
} elseif (($match[5] ?? '') !== '') {
// key='single quoted value'
$node->setAttribute($match[5], self::processEscapes($match[6] ?? ''));
} elseif (($match[7] ?? '') !== '') {
// key=unquoted
$node->setAttribute($match[7], $match[8] ?? '');
$node->setAttribute($match[5], $match[6] ?? '');
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyToNode() now parses .class/#id in a separate pass before key=value parsing. Because attribute rendering preserves insertion order (except for moving id first in HtmlRenderer), this can change the order of attributes relative to the source for inputs like {key=val .foo} (class would be inserted before key). To avoid output-order regressions, consider parsing all tokens in a single left-to-right pass (or using a regex strategy like (*SKIP)(*F) to ignore values while still emitting matches in source order) and applying them to the node in match order.

Copilot uses AI. Check for mistakes.
$result = $this->converter->convert('[link]{ref="/guides/setup.html"}');

$this->assertStringContainsString('ref="/guides/setup.html"', $result);
$this->assertStringNotContainsString('class', $result);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion assertStringNotContainsString('class', $result) is overly broad and can false-fail if the output contains the substring class outside of a generated class attribute. It would be more precise to assert that class=" (or class=) is absent, so the test only fails when a class attribute is actually emitted.

Suggested change
$this->assertStringNotContainsString('class', $result);
$this->assertStringNotContainsString('class="', $result);

Copilot uses AI. Check for mistakes.
dereuromark added a commit that referenced this pull request Mar 23, 2026
PR #105 introduced a regression where .class and #id shorthand were
always parsed before key=value pairs, causing attributes to appear
in a different order than the source.

The reference JS implementation (jgm/djot) preserves source order:
    {key=val .foo} → key="val" class="foo"

Before this fix:
    {key=val .foo} → class="foo" key="val" (wrong)

After this fix:
    {key=val .foo} → key="val" class="foo" (correct)

Changes:
- Rewrote applyToNode() to use a single-pass regex that matches all
  token types in source order (quoted values, unquoted values, .class,
  #id, boolean attributes)
- Added skip pattern for invalid unquoted values (e.g. key=foo.bar)
  to prevent dots from being matched as .class shorthand
- Fixed test assertion: 'class' → 'class="' (per Copilot review)
- Added 6 new tests for attribute ordering

Note: HtmlRenderer intentionally moves 'id' to the front of attribute
lists - this is pre-existing behavior and not addressed here.
dereuromark added a commit that referenced this pull request Mar 23, 2026
PR #105 introduced a regression where .class and #id shorthand were
always parsed before key=value pairs, causing attributes to appear
in a different order than the source.

The reference JS implementation (jgm/djot) preserves source order:
    {key=val .foo} → key="val" class="foo"

Before this fix:
    {key=val .foo} → class="foo" key="val" (wrong)

After this fix:
    {key=val .foo} → key="val" class="foo" (correct)

Changes:
- Rewrote applyToNode() to use a single-pass regex that matches all
  token types in source order (quoted values, unquoted values, .class,
  #id, boolean attributes)
- Added skip pattern for invalid unquoted values (e.g. key=foo.bar)
  to prevent dots from being matched as .class shorthand
- Fixed test assertion: 'class' → 'class="' (per Copilot review)
- Added 6 new tests for attribute ordering

Note: HtmlRenderer intentionally moves 'id' to the front of attribute
lists - this is pre-existing behavior and not addressed here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants