Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New and old parse errors #107

Open
stevecheckoway opened this issue Oct 1, 2018 · 6 comments
Open

New and old parse errors #107

stevecheckoway opened this issue Oct 1, 2018 · 6 comments

Comments

@stevecheckoway
Copy link
Contributor

I'm trying to test that an HTML parser produces the correct errors but some of the tests list #errors and #new-errors where the new errors seem to be the new standard names for old errors. For example,

#data
<?COM--MENT?>
#errors
(1,1): expected-tag-name-but-got-question-mark
(1,13): expected-doctype-but-got-eof
#new-errors
(1:2) unexpected-question-mark-instead-of-tag-name
#document
| <!-- ?COM--MENT? -->
| <html>
|   <head>
|   <body>

I'm not sure why the first error is there, but I assume an older version of the standard had something like a look ahead on the <. As I read the standard, there should be an unexpected-question-mark-instead-of-tag-name and a (currently unnamed in the standard) parse error for the missing DOCTYPE.

Is it always the case that if there's a new error that it replaces exactly one of the old errors? If not, how should these be handled?

@gsnedders
Copy link
Member

Looking at #92 again, which introduced these:

Note that for now we've decided to move new error codes to a separate section in tree construction stage tests to not mix things up. Once we have a spec for tree construction stage errors, we'll remove old errors and move new errors to #errors section.

So I think in principle #new-errors should be complete for tokenizer errors, and the number of lines in #errors be the number of total parse errors (tokenizer+tree construction)?

@stevecheckoway
Copy link
Contributor Author

@gsnedders great, thanks! That eliminates about 250 test failures for me, 120 to go!

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 2, 2018
Named character references in attributes whose last character is not `;`
and for which the next input character is `=` (or ASCII alphanumeric,
but this isn't tested here), flushes the code points consumed as a
character reference _without_ adding a parse error.

Named character references not in attributes whose last character is not
`;` are errors, regardless of the following character as noted in the
`#new-errors` section but without an entry in `#errors`, the number of
errors are wrong. (See
html5lib#107).

Separately, this adds the missing expected-doctype-but-got-start-tag
error.
@stevecheckoway
Copy link
Contributor Author

@gsnedders Is #113 the right approach here? Two of the errors aren't errors (any longer?) and it looks like one is now but wasn't before.

It's also possible that the error logic didn't actually flip as that PR would indicate and instead the test was just wrong before.

@stevecheckoway stevecheckoway reopened this Oct 2, 2018
@gsnedders
Copy link
Member

@stevecheckoway FWIW the error data is by far the most likely to be wrong bit of data in the testsuite, because very few implementations use it, so it's entirely plausible that the test was just wrong

@stevecheckoway
Copy link
Contributor Author

@gsnedders Makes sense. Without standardized error names (at least until now), it's pretty tricky to test that the errors are correct. Number of errors is a pretty weak proxy so I can see why people wouldn't bother.

For what it's worth, those eight PRs resolve about half of the 120 test failures from different numbers of errors I'm getting when running against the #script-off tests with Nokogumbo. (Which isn't to say the remaining errors are entirely in the testsuite, I've caught bugs in my code, which of course, is the whole purpose of my testing.)

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 2, 2018
If the `#errors` section should have the same number of lines as errors
(see html5lib#107), then the
NULL-character errors need to be accounted for.
stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 3, 2018
Named character references in attributes whose last character is not `;`
and for which the next input character is `=` (or ASCII alphanumeric,
but this isn't tested here), flushes the code points consumed as a
character reference _without_ adding a parse error.

Named character references not in attributes whose last character is not
`;` are errors, regardless of the following character as noted in the
`#new-errors` section but without an entry in `#errors`, the number of
errors are wrong. (See
html5lib#107).

Separately, this adds the missing expected-doctype-but-got-start-tag
error.
stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Oct 3, 2018
If the `#errors` section should have the same number of lines as errors
(see html5lib#107), then the
NULL-character errors need to be accounted for.
@stevecheckoway
Copy link
Contributor Author

Any updates on updating the errors in the test to reflect the spec?

The tree-construction stage still doesn't have its own error codes, unfortunately, but it'd be nice for the tests to at least have the correct number of errors.

stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Jun 26, 2021
Named character references in attributes whose last character is not `;`
and for which the next input character is `=` (or ASCII alphanumeric,
but this isn't tested here), flushes the code points consumed as a
character reference _without_ adding a parse error.

Named character references not in attributes whose last character is not
`;` are errors, regardless of the following character as noted in the
`#new-errors` section but without an entry in `#errors`, the number of
errors are wrong. (See
html5lib#107).

Separately, this adds the missing expected-doctype-but-got-start-tag
error.
stevecheckoway added a commit to stevecheckoway/html5lib-tests that referenced this issue Jun 26, 2021
If the `#errors` section should have the same number of lines as errors
(see html5lib#107), then the
NULL-character errors need to be accounted for.
jgraham pushed a commit that referenced this issue Jul 5, 2021
Named character references in attributes whose last character is not `;`
and for which the next input character is `=` (or ASCII alphanumeric,
but this isn't tested here), flushes the code points consumed as a
character reference _without_ adding a parse error.

Named character references not in attributes whose last character is not
`;` are errors, regardless of the following character as noted in the
`#new-errors` section but without an entry in `#errors`, the number of
errors are wrong. (See
#107).

Separately, this adds the missing expected-doctype-but-got-start-tag
error.
jgraham pushed a commit that referenced this issue Jul 5, 2021
If the `#errors` section should have the same number of lines as errors
(see #107), then the
NULL-character errors need to be accounted for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants