Skip to content
This repository has been archived by the owner on Apr 14, 2019. It is now read-only.

Reports errors for unicode #11

Open
petdance opened this issue Dec 5, 2011 · 17 comments
Open

Reports errors for unicode #11

petdance opened this issue Dec 5, 2011 · 17 comments
Labels

Comments

@petdance
Copy link
Owner

petdance commented Dec 5, 2011

Reported by sciurus, Sep 26, 2008

Try to parse a string containing characters such as “directional quotation
marks”. HTML::Lint 2.04 reports invalid character errors.

Comment 1 by gavin.brock, Dec 8, 2009

To add to this, I think that if the file is declared as some flavor of Unicode (e.g. ) shouldn't the "
Invalid character \x65E5 should be written as" errors not be there at all.

Comment 2 by bishopw, Feb 13, 2011

I run into this when trying to run HTML::Lint on a set of pages with Japanese text.

If my name in Japanese, ビショップ, appears on the page, HTML::Lint tells me:

#  (122:21) Invalid character \x30D3 should be written as 
#  (122:21) Invalid character \x30B7 should be written as 
#  (122:21) Invalid character \x30E7 should be written as 
#  (122:21) Invalid character \x30C3 should be written as 
#  (122:21) Invalid character \x30D7 should be written as 

(The line ends where it looks like a different encoding suggestion should be.)

My pages do all include the meta tag declaring charset=utf-8.

Is it possible to increase the priority on this, since it makes HTML::Lint unusable for sites using utf-8, which is becoming the Internet default encoding?

@petdance petdance added the Bug label Dec 7, 2016
@Ovid
Copy link

Ovid commented Dec 30, 2016

We're also getting hit by this for one of our projects. We're probably going to remove HTML::Lint support or locally fork it because this is a showstopper.

@petdance
Copy link
Owner Author

What would you do to fix it? You say you would fork it, so you must have something in mind that would alleviate the problem. Please let me know what that would be.

@Ovid
Copy link

Ovid commented Dec 30, 2016

It's one of our devs who was considering that. If anything, we'd simply remove the check for our code since we declare our HTML to be valid UTF-8. I would check the declared charset and if the character is valid for that charset, ignore it.

A simpler (but less correct) fix is in HTML::Lint::Error. Here's the last line in the %errors hash:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

That's what's causing the error. We thought that we could limit ourselves to STRUCTURE errors and be fine, but this is triggered incorrectly as a STRUCTURE error. I would probably change that to FLUFF.

@petdance
Copy link
Owner Author

I don't know squat about encodings and how to deal with them, so I could use some direction on this. I'd love to hear what your dev comes up with.

I'm also wondering if it might make sense to add the ability to exclude a given error, not just a class of errors.

@Ovid
Copy link

Ovid commented Dec 30, 2016

Fair enough. Simply changing this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', STRUCTURE],

To this:

'text-use-entity' => ['Character "${char}" should be written as ${entity}', FLUFF],

Would be enough for right now. I suspect many people already ignore the fluff as more and more tools are requiring new attributes to be added to tags and the check for known attributes doesn't seem to make a lot of sense now. Thus, "fluff" doesn't help for a modern web app.

However, as you pointed out, the ability to exclude a given error would be much more flexible.

@petdance
Copy link
Owner Author

Seems to me that encoding would be a STRUCTURE thing.

In the original post, it says "if the file is declared as some flavor of Unicode". How can I detect that?

@Ovid
Copy link

Ovid commented Dec 30, 2016

Look for the charset meta tag. There's a short version: <meta charset="utf-8" /> and a longer version:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I would argue that encoding isn't really a structure thing because if the page contains 日本国, that's the content causing the error from HTML::Lint, not the structure containing the content. Plus, it's a false error since 日本国 should certainly be allowed on the page.

@petdance
Copy link
Owner Author

It's not a matter of "allowed on the page", but how it's represented.

The point of the rule is that if you have left curly double quotes, that should be &ldquo; instead of the character .

Maybe it should be "If you have a character that can be represented as an HTML entity but isn't"?

@Ovid
Copy link

Ovid commented Dec 30, 2016

Then again, if you want to check the charset and do it correctly, you have to start considering things like whether or not the BOM is present (shouldn't be for UTF-8 (but is allowed), musn't be for ISO-8859-1, required for UTF-16). That quickly gets tedious, I imagine.

@Ovid
Copy link

Ovid commented Dec 30, 2016

"If you have a character that can be represented as an HTML entity but isn't"

That would not work for us. For example, we're writing a game and one of the items in the game is named Reflectrix™. That (\x2122) symbol is perfectly valid, but you can drop in &trade;. The &trade; offers no benefit at all. In fact, many of the old HTML entities offer no value in a modern Web environment, but offer negative value if you have to spend time trying to encode them.

@petdance
Copy link
Owner Author

It's valid to my job, because we want to encode everything we can. We get stuff from other departments that are just cut & pasted from Word and want to make sure we know what we're putting out there.

But HTML::Lint was created back in 2005 and so it sounds like I'm also hearing that in modern web dev that kind of dropping in \x2122 is common.

@Ovid
Copy link

Ovid commented Dec 30, 2016

Then offering a way to exclude errors that are useful for some, but detrimental to others, sounds like the best compromise.

@petdance
Copy link
Owner Author

Agreed. I created a ticket at #54.

@Ovid
Copy link

Ovid commented Dec 30, 2016

IO::HTML will determine the encoding for you. With that, it should be easier to figure out if a character is valid or not. If it's declared as ISO-8859-1, then yes, you want HTML encoding. Otherwise, you probably won't need it (so long as you have valid byte sequences, but even then you'd have to check that the encoded byte sequences are valid for the declared encoding).

@petdance
Copy link
Owner Author

@Ovid I'm willing to add the error exclusions feature in #54, but want to make sure you're going to use it. If you're planning to bail on HTML::Lint anyway, then I won't bother putting the time in. But if it will make things easier for you, I'll be glad to do it.

@Ovid
Copy link

Ovid commented Dec 31, 2016

@petdance: yes, if this change is available, we'll be able to keep using this module. Thank you for your help.

@robrwo
Copy link

robrwo commented Jan 8, 2018

I agree with @Ovid here.

Currently we're using HTML::Lint::Pluggable to override/ignore certain error messages in the meantime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants