Skip to content

Question about the logic of merged "Character" tokens after PR #92 in tokenizer tests #96

@syjer

Description

@syjer

Hi, I've noticed that after the PR #92 in 2 test cases the "Character" tokens were merged even though some Tokenizer errors are present in between them.

Even though the "ParseError" is no more present in the output, all the tests are still in the following form (taken from test1.test):

{"description":"Entity without trailing semicolon (1)",
"input":"I'm &notit",
"output":[["Character","I'm "], ["Character", "\u00ACit"]],
"errors": [
    {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 }
]}

Previously, a "ParseError" would be present in between the 2 "Character" tokens, so they would not be merged as the README of the tokenizer specify:

 All adjacent character tokens are coalesced into a single ["Character", data] token. 

All the tests follow the old logic except the following 2 (new?) tests in domjs.test:

{
           "description":"NUL in script HTML comment",
           "doubleEscaped":true,
           "initialStates":["Script data state"],
           "input":"<!--test\\u0000--><!--test-\\u0000--><!--test--\\u0000-->",
           "output":[["Character", "<!--test\\uFFFD--><!--test-\\uFFFD--><!--test--\\uFFFD-->"]],
           "errors":[
               { "code": "unexpected-null-character", "line": 1, "col": 9 },
               { "code": "unexpected-null-character", "line": 1, "col": 22 },
               { "code": "unexpected-null-character", "line": 1, "col": 36 }
           ]
        }

I would expect:

["Character", "<!--test"], ["Character", "\\uFFFD--><!--test-"], ["Character", "\\uFFFD--><!--test--"], ["Character", "\\uFFFD-->"]

and

{
           "description":"NUL in script HTML comment - double escaped",
           "doubleEscaped":true,
           "initialStates":["Script data state"],
           "input":"<!--<script>\\u0000--><!--<script>-\\u0000--><!--<script>--\\u0000-->",
           "output":[["Character", "<!--<script>\\uFFFD--><!--<script>-\\uFFFD--><!--<script>--\\uFFFD-->"]],
           "errors":[
                { "code": "unexpected-null-character", "line": 1, "col": 13 },
                { "code": "unexpected-null-character", "line": 1, "col": 30 },
                { "code": "unexpected-null-character", "line": 1, "col": 48 }
           ]
        }

I would expect:

["Character", "<!--<script>"], ["Character", "\\uFFFD--><!--<script>-"], ["Character", "\\uFFFD--><!--<script>--"], ["Character", "\\uFFFD-->"]

So, my questions is:

are the 2 new tests considered correct and all the others should have the "Characters" merged as in the README ? Or is there something that I'm missing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions