New and old parse errors #107

stevecheckoway · 2018-10-01T17:24:57Z

I'm trying to test that an HTML parser produces the correct errors but some of the tests list #errors and #new-errors where the new errors seem to be the new standard names for old errors. For example,

#data
<?COM--MENT?>
#errors
(1,1): expected-tag-name-but-got-question-mark
(1,13): expected-doctype-but-got-eof
#new-errors
(1:2) unexpected-question-mark-instead-of-tag-name
#document
| <!-- ?COM--MENT? -->
| <html>
|   <head>
|   <body>

I'm not sure why the first error is there, but I assume an older version of the standard had something like a look ahead on the <. As I read the standard, there should be an unexpected-question-mark-instead-of-tag-name and a (currently unnamed in the standard) parse error for the missing DOCTYPE.

Is it always the case that if there's a new error that it replaces exactly one of the old errors? If not, how should these be handled?

The text was updated successfully, but these errors were encountered:

gsnedders · 2018-10-01T17:28:29Z

Looking at #92 again, which introduced these:

Note that for now we've decided to move new error codes to a separate section in tree construction stage tests to not mix things up. Once we have a spec for tree construction stage errors, we'll remove old errors and move new errors to #errors section.

So I think in principle #new-errors should be complete for tokenizer errors, and the number of lines in #errors be the number of total parse errors (tokenizer+tree construction)?

stevecheckoway · 2018-10-01T17:44:49Z

@gsnedders great, thanks! That eliminates about 250 test failures for me, 120 to go!

Named character references in attributes whose last character is not `;` and for which the next input character is `=` (or ASCII alphanumeric, but this isn't tested here), flushes the code points consumed as a character reference _without_ adding a parse error. Named character references not in attributes whose last character is not `;` are errors, regardless of the following character as noted in the `#new-errors` section but without an entry in `#errors`, the number of errors are wrong. (See html5lib#107). Separately, this adds the missing expected-doctype-but-got-start-tag error.

stevecheckoway · 2018-10-02T02:14:04Z

@gsnedders Is #113 the right approach here? Two of the errors aren't errors (any longer?) and it looks like one is now but wasn't before.

It's also possible that the error logic didn't actually flip as that PR would indicate and instead the test was just wrong before.

gsnedders · 2018-10-02T15:21:57Z

@stevecheckoway FWIW the error data is by far the most likely to be wrong bit of data in the testsuite, because very few implementations use it, so it's entirely plausible that the test was just wrong

stevecheckoway · 2018-10-02T15:34:45Z

@gsnedders Makes sense. Without standardized error names (at least until now), it's pretty tricky to test that the errors are correct. Number of errors is a pretty weak proxy so I can see why people wouldn't bother.

For what it's worth, those eight PRs resolve about half of the 120 test failures from different numbers of errors I'm getting when running against the #script-off tests with Nokogumbo. (Which isn't to say the remaining errors are entirely in the testsuite, I've caught bugs in my code, which of course, is the whole purpose of my testing.)

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

Named character references in attributes whose last character is not `;` and for which the next input character is `=` (or ASCII alphanumeric, but this isn't tested here), flushes the code points consumed as a character reference _without_ adding a parse error. Named character references not in attributes whose last character is not `;` are errors, regardless of the following character as noted in the `#new-errors` section but without an entry in `#errors`, the number of errors are wrong. (See html5lib#107). Separately, this adds the missing expected-doctype-but-got-start-tag error.

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

stevecheckoway · 2021-03-12T16:49:16Z

Any updates on updating the errors in the test to reflect the spec?

The tree-construction stage still doesn't have its own error codes, unfortunately, but it'd be nice for the tests to at least have the correct number of errors.

Named character references in attributes whose last character is not `;` and for which the next input character is `=` (or ASCII alphanumeric, but this isn't tested here), flushes the code points consumed as a character reference _without_ adding a parse error. Named character references not in attributes whose last character is not `;` are errors, regardless of the following character as noted in the `#new-errors` section but without an entry in `#errors`, the number of errors are wrong. (See html5lib#107). Separately, this adds the missing expected-doctype-but-got-start-tag error.

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

Named character references in attributes whose last character is not `;` and for which the next input character is `=` (or ASCII alphanumeric, but this isn't tested here), flushes the code points consumed as a character reference _without_ adding a parse error. Named character references not in attributes whose last character is not `;` are errors, regardless of the following character as noted in the `#new-errors` section but without an entry in `#errors`, the number of errors are wrong. (See #107). Separately, this adds the missing expected-doctype-but-got-start-tag error.

If the `#errors` section should have the same number of lines as errors (see #107), then the NULL-character errors need to be accounted for.

stevecheckoway closed this as completed Oct 1, 2018

stevecheckoway mentioned this issue Oct 1, 2018

Comment errors #108

Closed

stevecheckoway mentioned this issue Oct 2, 2018

Fix entity errors #113

Closed

stevecheckoway reopened this Oct 2, 2018

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 2, 2018

Add new errors to errors

8890fc8

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

stevecheckoway mentioned this issue Oct 2, 2018

Add new errors to errors #118

Closed

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 3, 2018

Add new errors to errors

5c33b00

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

stevecheckoway mentioned this issue Mar 12, 2021

RFC: How to prepare for a future merge of nokogumbo into the nokogiri gem rubys/nokogumbo#170

Closed

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Jun 26, 2021

Add new errors to errors

5e88d5c

If the `#errors` section should have the same number of lines as errors (see html5lib#107), then the NULL-character errors need to be accounted for.

jgraham pushed a commit that referenced this issue Jul 5, 2021

Add new errors to errors

8d9655a

If the `#errors` section should have the same number of lines as errors (see #107), then the NULL-character errors need to be accounted for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New and old parse errors #107

New and old parse errors #107

stevecheckoway commented Oct 1, 2018

gsnedders commented Oct 1, 2018

stevecheckoway commented Oct 1, 2018

stevecheckoway commented Oct 2, 2018

gsnedders commented Oct 2, 2018

stevecheckoway commented Oct 2, 2018

stevecheckoway commented Mar 12, 2021

New and old parse errors #107

New and old parse errors #107

Comments

stevecheckoway commented Oct 1, 2018

gsnedders commented Oct 1, 2018

stevecheckoway commented Oct 1, 2018

stevecheckoway commented Oct 2, 2018

gsnedders commented Oct 2, 2018

stevecheckoway commented Oct 2, 2018

stevecheckoway commented Mar 12, 2021