Reports errors for unicode #11

petdance · 2011-12-05T23:03:27Z

Reported by sciurus, Sep 26, 2008

Try to parse a string containing characters such as “directional quotation
marks”. HTML::Lint 2.04 reports invalid character errors.

Comment 1 by gavin.brock, Dec 8, 2009

To add to this, I think that if the file is declared as some flavor of Unicode (e.g. ) shouldn't the "
Invalid character \x65E5 should be written as" errors not be there at all.

Comment 2 by bishopw, Feb 13, 2011

I run into this when trying to run HTML::Lint on a set of pages with Japanese text.

If my name in Japanese, ビショップ, appears on the page, HTML::Lint tells me:

#  (122:21) Invalid character \x30D3 should be written as 
#  (122:21) Invalid character \x30B7 should be written as 
#  (122:21) Invalid character \x30E7 should be written as 
#  (122:21) Invalid character \x30C3 should be written as 
#  (122:21) Invalid character \x30D7 should be written as

(The line ends where it looks like a different encoding suggestion should be.)

My pages do all include the meta tag declaring charset=utf-8.

Is it possible to increase the priority on this, since it makes HTML::Lint unusable for sites using utf-8, which is becoming the Internet default encoding?

The text was updated successfully, but these errors were encountered:

Ovid · 2016-12-30T11:14:06Z

We're also getting hit by this for one of our projects. We're probably going to remove HTML::Lint support or locally fork it because this is a showstopper.

petdance · 2016-12-30T14:54:41Z

What would you do to fix it? You say you would fork it, so you must have something in mind that would alleviate the problem. Please let me know what that would be.

Ovid · 2016-12-30T15:33:24Z

It's one of our devs who was considering that. If anything, we'd simply remove the check for our code since we declare our HTML to be valid UTF-8. I would check the declared charset and if the character is valid for that charset, ignore it.

A simpler (but less correct) fix is in HTML::Lint::Error. Here's the last line in the %errors hash:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

That's what's causing the error. We thought that we could limit ourselves to STRUCTURE errors and be fine, but this is triggered incorrectly as a STRUCTURE error. I would probably change that to FLUFF.

petdance · 2016-12-30T15:35:45Z

I don't know squat about encodings and how to deal with them, so I could use some direction on this. I'd love to hear what your dev comes up with.

I'm also wondering if it might make sense to add the ability to exclude a given error, not just a class of errors.

Ovid · 2016-12-30T15:39:53Z

Fair enough. Simply changing this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

To this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', FLUFF],

Would be enough for right now. I suspect many people already ignore the fluff as more and more tools are requiring new attributes to be added to tags and the check for known attributes doesn't seem to make a lot of sense now. Thus, "fluff" doesn't help for a modern web app.

However, as you pointed out, the ability to exclude a given error would be much more flexible.

petdance · 2016-12-30T15:53:03Z

Seems to me that encoding would be a STRUCTURE thing.

In the original post, it says "if the file is declared as some flavor of Unicode". How can I detect that?

Ovid · 2016-12-30T16:08:05Z

Look for the charset meta tag. There's a short version: <meta charset="utf-8" /> and a longer version:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I would argue that encoding isn't really a structure thing because if the page contains 日本国, that's the content causing the error from HTML::Lint, not the structure containing the content. Plus, it's a false error since 日本国 should certainly be allowed on the page.

petdance · 2016-12-30T16:10:48Z

It's not a matter of "allowed on the page", but how it's represented.

The point of the rule is that if you have left curly double quotes, that should be “ instead of the character “.

Maybe it should be "If you have a character that can be represented as an HTML entity but isn't"?

Ovid · 2016-12-30T16:12:35Z

Then again, if you want to check the charset and do it correctly, you have to start considering things like whether or not the BOM is present (shouldn't be for UTF-8 (but is allowed), musn't be for ISO-8859-1, required for UTF-16). That quickly gets tedious, I imagine.

Ovid · 2016-12-30T16:15:35Z

"If you have a character that can be represented as an HTML entity but isn't"

That would not work for us. For example, we're writing a game and one of the items in the game is named Reflectrix™. That ™ (\x2122) symbol is perfectly valid, but you can drop in ™. The ™ offers no benefit at all. In fact, many of the old HTML entities offer no value in a modern Web environment, but offer negative value if you have to spend time trying to encode them.

petdance · 2016-12-30T16:17:59Z

It's valid to my job, because we want to encode everything we can. We get stuff from other departments that are just cut & pasted from Word and want to make sure we know what we're putting out there.

But HTML::Lint was created back in 2005 and so it sounds like I'm also hearing that in modern web dev that kind of dropping in \x2122 is common.

Ovid · 2016-12-30T16:21:01Z

Then offering a way to exclude errors that are useful for some, but detrimental to others, sounds like the best compromise.

petdance · 2016-12-30T16:26:01Z

Agreed. I created a ticket at #54.

Ovid · 2016-12-30T16:26:43Z

IO::HTML will determine the encoding for you. With that, it should be easier to figure out if a character is valid or not. If it's declared as ISO-8859-1, then yes, you want HTML encoding. Otherwise, you probably won't need it (so long as you have valid byte sequences, but even then you'd have to check that the encoded byte sequences are valid for the declared encoding).

petdance · 2016-12-30T22:44:02Z

@Ovid I'm willing to add the error exclusions feature in #54, but want to make sure you're going to use it. If you're planning to bail on HTML::Lint anyway, then I won't bother putting the time in. But if it will make things easier for you, I'll be glad to do it.

Ovid · 2016-12-31T12:03:42Z

@petdance: yes, if this change is available, we'll be able to keep using this module. Thank you for your help.

robrwo · 2018-01-08T11:37:51Z

I agree with @Ovid here.

Currently we're using HTML::Lint::Pluggable to override/ignore certain error messages in the meantime.

petdance added the Bug label Dec 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reports errors for unicode #11

Reports errors for unicode #11

petdance commented Dec 5, 2011

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 •

edited

Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 •

edited

Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 •

edited

Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 31, 2016

robrwo commented Jan 8, 2018 •

edited

Loading

Reports errors for unicode #11

Reports errors for unicode #11

Comments

petdance commented Dec 5, 2011

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 • edited Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 • edited Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016 • edited Loading

petdance commented Dec 30, 2016

Ovid commented Dec 30, 2016

petdance commented Dec 30, 2016

Ovid commented Dec 31, 2016

robrwo commented Jan 8, 2018 • edited Loading

Ovid commented Dec 30, 2016 •

edited

Loading

Ovid commented Dec 30, 2016 •

edited

Loading

Ovid commented Dec 30, 2016 •

edited

Loading

robrwo commented Jan 8, 2018 •

edited

Loading