Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors in the tokenizer and tree construction tests. #136

Merged
merged 23 commits into from Jul 5, 2021

Conversation

stevecheckoway
Copy link
Contributor

For several years, I've maintained a bunch of branches with fixes to the errors in the html5lib-tests's tokenizer and tree construction tests. There were corresponding PRs that had been open for years (#108, #109, #110, #111, #112, #113, #114, #115, #116, #117, #118, #119).

I don't think any of the html5lib-tests maintainers is currently interested in errors. That's completely fine, of course, and totally understandable since only the tokenizer errors have actually been standardized.

But in the mean time, having the correct number of errors in the tests is useful to me as I maintain the gumbo HTML parser that's used in Nokogiri. Maintaining individual branches for the specific PRs listed above plus a branch that contains all of the individual commits is too troublesome. To that end, I closed all the PRs above and I'm opening this one which contains all of the earlier fixes plus new fixes I added in the past few days.

I plan to sporadically keep this branch (and thus this PR) up to date.

`<!-- -- -->` is not an error. The middle `--` cause the tokenizer to
enter the comment end state. The "anything else" clause appends two `-`
to the comment token's data and the current input character is
reconsumed in the comment state.
Starting from the data state, we have
1. `<` switches to the tag open state
2. `!` switches to the markup declaration open state
3. `--` switches to the comment start state
4. `-` (the third one) switches to the comment start dash state
5. `-` (the fourth one) switches to the comment end state
6. `-` (the fifth one) appends `-` to the comment and does not change
   state
7. `>` emits the comment token and switches to the data state.
A DOCTYPE token is an error in one of three cases:
1. The token's name is not `html`;
2. The token's public identifier is not missing;
3. The token's system identifier is not missing and the token's system
   identifier isn't `about:legacy-compat`.

This appears to have changed at some point from a much more complex set
of conditions.
A `>` after `DOCTYPE` is a missing-doctype-name parse error but it is
not also a missing-whitespace-before-doctype-name parse error.

Doctypes with a public identifier is a (currently unnamed) parse error.
Named character references in attributes whose last character is not `;`
and for which the next input character is `=` (or ASCII alphanumeric,
but this isn't tested here), flushes the code points consumed as a
character reference _without_ adding a parse error.

Named character references not in attributes whose last character is not
`;` are errors, regardless of the following character as noted in the
`#new-errors` section but without an entry in `#errors`, the number of
errors are wrong. (See
html5lib#107).

Separately, this adds the missing expected-doctype-but-got-start-tag
error.
Here are the (abbreviated) steps (intermingling the tree-construction
and tokenizer).

1. `<script>` token causes `html` and `head` elements to be inserted and
   is processed in the "in head" insertion mode
2. `<script>` token switches the tokenizer to the script data state and
   switches to the "text" insertion mode
3. `<` switches to the script data less-than sign state
4. `!` switches to the script data escape start state and emits `<!`
5. EOF is reconsumed in the script data state
6. EOF emits an EOF token
7. EOF token (in the "text" insertion mode) is a parse error,
   `<script>` is popped off the stack of open elements, switches back to
   the "in head" insertion mode and reprocesses the token
8. EOF token (in "in head") triggers the "anything else clause" which
   pops the `head` element, switches to "after head", inserts a `body`
   token, switches to "in body", and reprocesses
9. EOF token (in "in body") stops parsing with no error because the
   stack of open elements contains `html` and `body`

Only step 7 adds a parse error.
Without it, the number of errors is incorrect.
```
A<table><tr> B</tr> </em>C</table>
```

The second space isn't foster parented. The `</tr>` triggers the
"anything else" clause of the in table text insertion mode which causes
` B` to be foster parented. The following space clears the pending table
character tokens list and thus the `</em>` inserts the space without
foster parenting.

This is also clear from the text node `A BC` in the result. A foster
parented second space would result in `A B C`.
```
<math><annotation-xml></svg>x
```

The `</svg>` is parsed in foreign context because
1. The stack of open elements is not empty;
2. The adjusted current node (`annotation-xml` element) is not in the
   HTML namespace;
3. The adjusted current node is not a MathML text integration point;
4. The adjusted current node is not a MathML text integration point
   (separate condition from 3);
5. The adjusted current node is a MathML `annotation-xml` element but
   the token (`</svg>`) is not a start tag;
6. The token is not a start tag (also the `annotation-xml` isn't an HTML
   integration point because it lacks a particular attribute);
7. The token isn't a character token (also the `annotation-xml` isn't an
   HTML integration point).

Thus the "any other end tag" clause of 12.2.6.5 "The rules for parsing
tokens in foreign context" applies.

The current node's tag name (`annotation-xml`) is not the same as the
tag name of the token (`svg`) so this is a parse error.

Since there's no `svg` element in the stack of open elements, the loop
in step 3 through 6 of the "any other end tag" clause exits when the
`body` element is reached. and the token is processed according to the
current insertion mode, "in body."

The `</svg>` matches the "any other end tag" of "in body." Since the
MathML `annotation-xml` element is not an HTML element but is special
which is a second parse error.
These are tricky. A `<math>` tag as the first token in a document
fragment whose context node is `td` (or `th`, but this isn't tested
here) is fine. One in a document fragment whose context node is `tr`,
`thead`, `tbody`, or `tfoot` is a parse error and the elements are
foster parented.

The table element start tags after the `<mo>` tag are parsed as html and
there are no `tr` (for `tr` elements) elements or `thead`, `tbody`, or
`tfoot` (for the others) elements in table scope which is a parse error.
I just invented some sames for those.

The `</table>` tag after the `<mo>` is parsed as foreign but it doesn't
match `<mo>` so it's a parse error.

Finally, the EOF occurs while a bunch of elements are open which is a
parse error.
The `</td>` is parsed in the "in cell" insertion mode and is an error
because it doesn't match the open `span` element. Each of the three
characters in `Foo` is reparented and is an error.
If the `#errors` section should have the same number of lines as errors
(see html5lib#107), then the
NULL-character errors need to be accounted for.
Also fix up the `#new-errors` section to make the line and column
numbers match the others.
These all appear to be duplicated errors (or more like an explanation of
the error).
Each character in the table (caused by the `<col>`) gets reparented and
causes an error.

The second `<a>` gets reparented but there's already one open, so that's
a second error but the open one is not in scope so that's a third error.
Fix the line and column numbers.

Currently, the number of errors in the `#errors` section of the test is
the correct number of errors. Counting the ones in `#new-errors` breaks
hundreds of tests because they contain an equivalent error in `#errors`.
So this adds the eof in comment error to `#errors` as well.
Adds the other spec-mandated errors. When an error is caused by a tag,
the line and column numbers are the line and column the tag starts in.
This error line appears to have been duplicated by mistake.
@Ms2ger
Copy link
Contributor

Ms2ger commented Jun 28, 2021

If @hsivonen (or someone else) doesn't object in the next two weeks, I'd just merge this.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs=me. Thanks.

@jgraham jgraham merged commit 6030cb6 into html5lib:master Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants