Fix .class and #id matching inside quoted attribute values by danielledeleo · Pull Request #105 · php-collective/djot-php

danielledeleo · 2026-03-23T04:58:27Z

Hey! I found this while building a documentation tool on top of djot-php.

When an attribute value contains a dot or hash — e.g. {include="note.dj"} — the .class regex matches .dj" inside the quoted string, producing a spurious class="dj" alongside the intended include="note.dj".

Repro:

[foo]{include="bar.dj"}

// Before fix: <p><span class="dj&quot;" include="bar.dj">foo</span></p>
// After fix:  <p><span include="bar.dj">foo</span></p>

Root cause: The .class and #id regexes run over the raw attribute string without skipping quoted values. The reference JS implementation doesn't have this issue because its state machine naturally ignores content inside quotes.

Fix: Strip quoted values before running the .class and #id regexes — the same pattern already used for boolean attribute matching in the same file. Minimal change, no architectural changes.

The regex-based .class and #id parsers in AttributeParser were matching dots and hashes inside quoted attribute values, e.g. {include="note.dj"} would produce a spurious class="dj" alongside include="note.dj". The reference JS implementation (jgm/djot) avoids this because its state machine naturally skips quoted content. The fix strips quoted values before running the .class and #id regexes, consistent with how boolean attributes are already handled in the same file. Affected methods: parse() and applyToNode().

codecov · 2026-03-23T05:01:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.55%. Comparing base (fd9207c) to head (af64c55).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master     #105   +/-   ##
=========================================
  Coverage     93.55%   93.55%           
- Complexity     2315     2316    +1     
=========================================
  Files            79       79           
  Lines          6113     6117    +4     
=========================================
+ Hits           5719     5723    +4     
  Misses          394      394

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

)" This reverts commit cb936d0.

Copilot

Pull request overview

Fixes a parsing bug where .class / #id shorthand was incorrectly detected inside quoted attribute values (e.g. include="note.dj"), leading to spurious class/id attributes in rendered output.

Changes:

Mask quoted substrings before running .class / #id regex matching in AttributeParser::parse().
Refactor AttributeParser::applyToNode() to parse shorthand on a stripped string while parsing key=value on the original string.
Add PHPUnit coverage to ensure dots/hashes in quoted values don’t produce class/id shorthand matches, including converter-level integration checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/Parser/Utility/AttributeParser.php`	Strips quoted values prior to shorthand regex matching; refactors `applyToNode()` parsing flow.
`tests/TestCase/Parser/AttributeParserTest.php`	Adds unit + integration tests for dots/hashes inside quoted attribute values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T05:06:17Z

+        // Strip quoted values before matching .class and #id to avoid
+        // matching dots/hashes inside attribute values like key="file.txt"
+        $strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr;
+        $strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand;
+
        // Parse .class
-        if (preg_match_all('/\.([^\s.#=}]+)/', $attrStr, $classMatches)) {
+        if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) {
            $attributes['class'] = implode(' ', $classMatches[1]);
        }

        // Parse #id
-        if (preg_match('/#([^\s.#=}]+)/', $attrStr, $idMatch)) {
+        if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) {
            $attributes['id'] = $idMatch[1];


$strippedForShorthand only removes quoted substrings, so .class / #id matching can still fire inside unquoted attribute values (e.g. include=note.dj would still match .dj and create a spurious class="dj"). Since parse() explicitly supports key=unquoted, consider stripping/skipping full key=value pairs (quoted and unquoted) before running the shorthand regexes (similar to the boolean-attr stripping below), or switch to a single tokenization pass that ignores values entirely when scanning for shorthand.

Copilot · 2026-03-23T05:06:18Z

+        // Strip quoted values before matching .class and #id to avoid
+        // matching dots/hashes inside attribute values like key="file.txt"
+        $strippedForShorthand = preg_replace('/"(?:[^"\\\\]|\\\\.)*"/', '', $attrStr) ?? $attrStr;
+        $strippedForShorthand = preg_replace("/'(?:[^'\\\\]|\\\\.)*'/", '', $strippedForShorthand) ?? $strippedForShorthand;
+
+        // Parse .class and #id on stripped string
+        if (preg_match_all('/\.([^\s.#=}]+)/', $strippedForShorthand, $classMatches)) {
+            foreach ($classMatches[1] as $class) {
+                $node->addClass($class);
+            }
+        }
+
+        if (preg_match('/#([^\s.#=}]+)/', $strippedForShorthand, $idMatch)) {
+            $node->setAttribute('id', $idMatch[1]);
+        }
+
+        // Parse key=value on original string (needs quoted values intact)
+        $kvPattern = '/([^\s.#=}]+)="((?:[^"\\\\]|\\\\.)*)"|([^\s.#=}]+)=\'((?:[^\'\\\\]|\\\\.)*)\''
            . '|([^\s.#=}]+)=([^\s}"\']+)/';
-        preg_match_all($pattern, $attrStr, $matches, PREG_SET_ORDER);
+        preg_match_all($kvPattern, $attrStr, $matches, PREG_SET_ORDER);

        foreach ($matches as $match) {
-            if (!empty($match[1])) {
-                // Class attribute
-                $node->addClass($match[1]);
-            } elseif (!empty($match[2])) {
-                // ID attribute
-                $node->setAttribute('id', $match[2]);
-            } elseif (($match[3] ?? '') !== '') {
+            if (($match[1] ?? '') !== '') {
                // key="double quoted value"
+                $node->setAttribute($match[1], self::processEscapes($match[2] ?? ''));
+            } elseif (($match[3] ?? '') !== '') {
+                // key='single quoted value'
                $node->setAttribute($match[3], self::processEscapes($match[4] ?? ''));
            } elseif (($match[5] ?? '') !== '') {
-                // key='single quoted value'
-                $node->setAttribute($match[5], self::processEscapes($match[6] ?? ''));
-            } elseif (($match[7] ?? '') !== '') {
                // key=unquoted
-                $node->setAttribute($match[7], $match[8] ?? '');
+                $node->setAttribute($match[5], $match[6] ?? '');
            }


applyToNode() now parses .class/#id in a separate pass before key=value parsing. Because attribute rendering preserves insertion order (except for moving id first in HtmlRenderer), this can change the order of attributes relative to the source for inputs like {key=val .foo} (class would be inserted before key). To avoid output-order regressions, consider parsing all tokens in a single left-to-right pass (or using a regex strategy like (*SKIP)(*F) to ignore values while still emitting matches in source order) and applying them to the node in match order.

Copilot · 2026-03-23T05:06:18Z

+        $result = $this->converter->convert('[link]{ref="/guides/setup.html"}');
+
+        $this->assertStringContainsString('ref="/guides/setup.html"', $result);
+        $this->assertStringNotContainsString('class', $result);


The assertion assertStringNotContainsString('class', $result) is overly broad and can false-fail if the output contains the substring class outside of a generated class attribute. It would be more precise to assert that class=" (or class=) is absent, so the test only fails when a class attribute is actually emitted.

Suggested change

$this->assertStringNotContainsString('class', $result);

$this->assertStringNotContainsString('class="', $result);

PR #105 introduced a regression where .class and #id shorthand were always parsed before key=value pairs, causing attributes to appear in a different order than the source. The reference JS implementation (jgm/djot) preserves source order: {key=val .foo} → key="val" class="foo" Before this fix: {key=val .foo} → class="foo" key="val" (wrong) After this fix: {key=val .foo} → key="val" class="foo" (correct) Changes: - Rewrote applyToNode() to use a single-pass regex that matches all token types in source order (quoted values, unquoted values, .class, #id, boolean attributes) - Added skip pattern for invalid unquoted values (e.g. key=foo.bar) to prevent dots from being matched as .class shorthand - Fixed test assertion: 'class' → 'class="' (per Copilot review) - Added 6 new tests for attribute ordering Note: HtmlRenderer intentionally moves 'id' to the front of attribute lists - this is pre-existing behavior and not addressed here.

dereuromark requested a review from Copilot March 23, 2026 05:02

Copilot started reviewing on behalf of dereuromark March 23, 2026 05:02 View session

dereuromark merged commit cb936d0 into php-collective:master Mar 23, 2026
8 checks passed

dereuromark added a commit that referenced this pull request Mar 23, 2026

Revert "Fix .class and #id matching inside quoted attribute values (#105

fb0b936

)" This reverts commit cb936d0.

Copilot AI reviewed Mar 23, 2026

View reviewed changes

dereuromark mentioned this pull request Mar 23, 2026

Validate unquoted attribute values per djot spec #106

Merged

dereuromark mentioned this pull request Mar 23, 2026

Preserve attribute source order in applyToNode() #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix .class and #id matching inside quoted attribute values#105

Fix .class and #id matching inside quoted attribute values#105
dereuromark merged 1 commit intophp-collective:masterfrom
danielledeleo:fix/attribute-parser-dot-in-quoted-values

danielledeleo commented Mar 23, 2026

Uh oh!

codecov Bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	$this->assertStringNotContainsString('class', $result);
	$this->assertStringNotContainsString('class="', $result);

Uh oh!

Conversation

danielledeleo commented Mar 23, 2026

Uh oh!

codecov Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Mar 23, 2026 •

edited

Loading