New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reports errors for unicode #11

Open
petdance opened this Issue Dec 5, 2011 · 17 comments

Comments

Projects
None yet
3 participants
@petdance
Owner

petdance commented Dec 5, 2011

Reported by sciurus, Sep 26, 2008

Try to parse a string containing characters such as “directional quotation
marks”. HTML::Lint 2.04 reports invalid character errors.

Comment 1 by gavin.brock, Dec 8, 2009

To add to this, I think that if the file is declared as some flavor of Unicode (e.g. ) shouldn't the "
Invalid character \x65E5 should be written as" errors not be there at all.

Comment 2 by bishopw, Feb 13, 2011

I run into this when trying to run HTML::Lint on a set of pages with Japanese text.

If my name in Japanese, ビショップ, appears on the page, HTML::Lint tells me:

#  (122:21) Invalid character \x30D3 should be written as 
#  (122:21) Invalid character \x30B7 should be written as 
#  (122:21) Invalid character \x30E7 should be written as 
#  (122:21) Invalid character \x30C3 should be written as 
#  (122:21) Invalid character \x30D7 should be written as 

(The line ends where it looks like a different encoding suggestion should be.)

My pages do all include the meta tag declaring charset=utf-8.

Is it possible to increase the priority on this, since it makes HTML::Lint unusable for sites using utf-8, which is becoming the Internet default encoding?

@petdance petdance added the Bug label Dec 7, 2016

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

We're also getting hit by this for one of our projects. We're probably going to remove HTML::Lint support or locally fork it because this is a showstopper.

Ovid commented Dec 30, 2016

We're also getting hit by this for one of our projects. We're probably going to remove HTML::Lint support or locally fork it because this is a showstopper.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

What would you do to fix it? You say you would fork it, so you must have something in mind that would alleviate the problem. Please let me know what that would be.

Owner

petdance commented Dec 30, 2016

What would you do to fix it? You say you would fork it, so you must have something in mind that would alleviate the problem. Please let me know what that would be.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

It's one of our devs who was considering that. If anything, we'd simply remove the check for our code since we declare our HTML to be valid UTF-8. I would check the declared charset and if the character is valid for that charset, ignore it.

A simpler (but less correct) fix is in HTML::Lint::Error. Here's the last line in the %errors hash:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

That's what's causing the error. We thought that we could limit ourselves to STRUCTURE errors and be fine, but this is triggered incorrectly as a STRUCTURE error. I would probably change that to FLUFF.

Ovid commented Dec 30, 2016

It's one of our devs who was considering that. If anything, we'd simply remove the check for our code since we declare our HTML to be valid UTF-8. I would check the declared charset and if the character is valid for that charset, ignore it.

A simpler (but less correct) fix is in HTML::Lint::Error. Here's the last line in the %errors hash:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

That's what's causing the error. We thought that we could limit ourselves to STRUCTURE errors and be fine, but this is triggered incorrectly as a STRUCTURE error. I would probably change that to FLUFF.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

I don't know squat about encodings and how to deal with them, so I could use some direction on this. I'd love to hear what your dev comes up with.

I'm also wondering if it might make sense to add the ability to exclude a given error, not just a class of errors.

Owner

petdance commented Dec 30, 2016

I don't know squat about encodings and how to deal with them, so I could use some direction on this. I'd love to hear what your dev comes up with.

I'm also wondering if it might make sense to add the ability to exclude a given error, not just a class of errors.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

Fair enough. Simply changing this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

To this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', FLUFF],

Would be enough for right now. I suspect many people already ignore the fluff as more and more tools are requiring new attributes to be added to tags and the check for known attributes doesn't seem to make a lot of sense now. Thus, "fluff" doesn't help for a modern web app.

However, as you pointed out, the ability to exclude a given error would be much more flexible.

Ovid commented Dec 30, 2016

Fair enough. Simply changing this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

To this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', FLUFF],

Would be enough for right now. I suspect many people already ignore the fluff as more and more tools are requiring new attributes to be added to tags and the check for known attributes doesn't seem to make a lot of sense now. Thus, "fluff" doesn't help for a modern web app.

However, as you pointed out, the ability to exclude a given error would be much more flexible.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

Seems to me that encoding would be a STRUCTURE thing.

In the original post, it says "if the file is declared as some flavor of Unicode". How can I detect that?

Owner

petdance commented Dec 30, 2016

Seems to me that encoding would be a STRUCTURE thing.

In the original post, it says "if the file is declared as some flavor of Unicode". How can I detect that?

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

Look for the charset meta tag. There's a short version: <meta charset="utf-8" /> and a longer version:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I would argue that encoding isn't really a structure thing because if the page contains 日本国, that's the content causing the error from HTML::Lint, not the structure containing the content. Plus, it's a false error since 日本国 should certainly be allowed on the page.

Ovid commented Dec 30, 2016

Look for the charset meta tag. There's a short version: <meta charset="utf-8" /> and a longer version:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I would argue that encoding isn't really a structure thing because if the page contains 日本国, that's the content causing the error from HTML::Lint, not the structure containing the content. Plus, it's a false error since 日本国 should certainly be allowed on the page.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

It's not a matter of "allowed on the page", but how it's represented.

The point of the rule is that if you have left curly double quotes, that should be &ldquo; instead of the character .

Maybe it should be "If you have a character that can be represented as an HTML entity but isn't"?

Owner

petdance commented Dec 30, 2016

It's not a matter of "allowed on the page", but how it's represented.

The point of the rule is that if you have left curly double quotes, that should be &ldquo; instead of the character .

Maybe it should be "If you have a character that can be represented as an HTML entity but isn't"?

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

Then again, if you want to check the charset and do it correctly, you have to start considering things like whether or not the BOM is present (shouldn't be for UTF-8 (but is allowed), musn't be for ISO-8859-1, required for UTF-16). That quickly gets tedious, I imagine.

Ovid commented Dec 30, 2016

Then again, if you want to check the charset and do it correctly, you have to start considering things like whether or not the BOM is present (shouldn't be for UTF-8 (but is allowed), musn't be for ISO-8859-1, required for UTF-16). That quickly gets tedious, I imagine.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

"If you have a character that can be represented as an HTML entity but isn't"

That would not work for us. For example, we're writing a game and one of the items in the game is named Reflectrix™. That (\x2122) symbol is perfectly valid, but you can drop in &trade;. The &trade; offers no benefit at all. In fact, many of the old HTML entities offer no value in a modern Web environment, but offer negative value if you have to spend time trying to encode them.

Ovid commented Dec 30, 2016

"If you have a character that can be represented as an HTML entity but isn't"

That would not work for us. For example, we're writing a game and one of the items in the game is named Reflectrix™. That (\x2122) symbol is perfectly valid, but you can drop in &trade;. The &trade; offers no benefit at all. In fact, many of the old HTML entities offer no value in a modern Web environment, but offer negative value if you have to spend time trying to encode them.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

It's valid to my job, because we want to encode everything we can. We get stuff from other departments that are just cut & pasted from Word and want to make sure we know what we're putting out there.

But HTML::Lint was created back in 2005 and so it sounds like I'm also hearing that in modern web dev that kind of dropping in \x2122 is common.

Owner

petdance commented Dec 30, 2016

It's valid to my job, because we want to encode everything we can. We get stuff from other departments that are just cut & pasted from Word and want to make sure we know what we're putting out there.

But HTML::Lint was created back in 2005 and so it sounds like I'm also hearing that in modern web dev that kind of dropping in \x2122 is common.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

Then offering a way to exclude errors that are useful for some, but detrimental to others, sounds like the best compromise.

Ovid commented Dec 30, 2016

Then offering a way to exclude errors that are useful for some, but detrimental to others, sounds like the best compromise.

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

Agreed. I created a ticket at #54.

Owner

petdance commented Dec 30, 2016

Agreed. I created a ticket at #54.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 30, 2016

IO::HTML will determine the encoding for you. With that, it should be easier to figure out if a character is valid or not. If it's declared as ISO-8859-1, then yes, you want HTML encoding. Otherwise, you probably won't need it (so long as you have valid byte sequences, but even then you'd have to check that the encoded byte sequences are valid for the declared encoding).

Ovid commented Dec 30, 2016

IO::HTML will determine the encoding for you. With that, it should be easier to figure out if a character is valid or not. If it's declared as ISO-8859-1, then yes, you want HTML encoding. Otherwise, you probably won't need it (so long as you have valid byte sequences, but even then you'd have to check that the encoded byte sequences are valid for the declared encoding).

@petdance

This comment has been minimized.

Show comment
Hide comment
@petdance

petdance Dec 30, 2016

Owner

@Ovid I'm willing to add the error exclusions feature in #54, but want to make sure you're going to use it. If you're planning to bail on HTML::Lint anyway, then I won't bother putting the time in. But if it will make things easier for you, I'll be glad to do it.

Owner

petdance commented Dec 30, 2016

@Ovid I'm willing to add the error exclusions feature in #54, but want to make sure you're going to use it. If you're planning to bail on HTML::Lint anyway, then I won't bother putting the time in. But if it will make things easier for you, I'll be glad to do it.

@Ovid

This comment has been minimized.

Show comment
Hide comment
@Ovid

Ovid Dec 31, 2016

@petdance: yes, if this change is available, we'll be able to keep using this module. Thank you for your help.

Ovid commented Dec 31, 2016

@petdance: yes, if this change is available, we'll be able to keep using this module. Thank you for your help.

@robrwo

This comment has been minimized.

Show comment
Hide comment
@robrwo

robrwo Jan 8, 2018

I agree with @Ovid here.

Currently we're using HTML::Lint::Pluggable to override/ignore certain error messages in the meantime.

robrwo commented Jan 8, 2018

I agree with @Ovid here.

Currently we're using HTML::Lint::Pluggable to override/ignore certain error messages in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment