New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding issue with gb2312 data #87

Closed
axtens opened this Issue May 21, 2013 · 7 comments

Comments

Projects
None yet
2 participants
@axtens

axtens commented May 21, 2013

Context: Microsoft JScript on Windows Server 2008 R2 64bit

var url = "http://www.google.com.hk/search?q=pennytel%20downloads&sa=%20%CB%D1%20%CB%F7%20&forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
var x = new URI(url);
var rMap = x.search(true);

When the .search is executed I get

Microsoft JScript runtime error: The URI to be decoded is not a valid encoding

The break occurs here

d.decodeQuery = function (a) {
    return d.decode((a + "").replace(/+/g, "%20"))
};

and it's probably complaining about the "sa=%20%CB%D1%20%CB%F7%20". What's amiss here? Is it fixable? Is it an encoding issue or something else?

@rodneyrehm

This comment has been minimized.

Show comment
Hide comment
@rodneyrehm

rodneyrehm May 21, 2013

Member

I can reproduce the issue in Firefox 21 on Mac. This sequence is the problem %CB%D1 - it can't be decoded by decodeURIComponent().

decodeURIComponent() expects UTF-8 escape sequences and fails if it can't resolve the input. Using unescape() the sequence resolves to ËÑ, which would properly be percent-encoded as %C3%8B%C3%91

Can you check what character's this sequence should resolve to? Can you make sure that the data is UTF-8?

Member

rodneyrehm commented May 21, 2013

I can reproduce the issue in Firefox 21 on Mac. This sequence is the problem %CB%D1 - it can't be decoded by decodeURIComponent().

decodeURIComponent() expects UTF-8 escape sequences and fails if it can't resolve the input. Using unescape() the sequence resolves to ËÑ, which would properly be percent-encoded as %C3%8B%C3%91

Can you check what character's this sequence should resolve to? Can you make sure that the data is UTF-8?

@axtens

This comment has been minimized.

Show comment
Hide comment
@axtens

axtens May 22, 2013

As far as I can tell, given the ie and oe variables (&ie=GB2312&oe=GB2312), the characters are GB2312 encoded chinese characters. If I store ËÑË÷ in a text file and, using BabelPad, read them in as GB2312, I get 脣脩脣梅. That expressed as UTF-8 is, in hex, E8 84 A3 E8 84 A9 E8 84 A3 E6 A2 85.

Now, how to deal with this is tricky because the original url has come into our website via Google Hong Kong so we have no way of controlling how the data is encoded. Do I change URI.js to use unescape? At the moment, I run every url through unescape() anyway so that URI.js doesn't crash on the weird ones.

axtens commented May 22, 2013

As far as I can tell, given the ie and oe variables (&ie=GB2312&oe=GB2312), the characters are GB2312 encoded chinese characters. If I store ËÑË÷ in a text file and, using BabelPad, read them in as GB2312, I get 脣脩脣梅. That expressed as UTF-8 is, in hex, E8 84 A3 E8 84 A9 E8 84 A3 E6 A2 85.

Now, how to deal with this is tricky because the original url has come into our website via Google Hong Kong so we have no way of controlling how the data is encoded. Do I change URI.js to use unescape? At the moment, I run every url through unescape() anyway so that URI.js doesn't crash on the weird ones.

@rodneyrehm

This comment has been minimized.

Show comment
Hide comment
@rodneyrehm

rodneyrehm May 22, 2013

Member

well, URI.js supports UTF8 and ISO 8859 mode. You could easily wrap things:

URI.prototype.getQueryParameters = function() {
  var uri = URI(this.search());
  try {
    return uri.search(true);
  } catch(e) {
    return uri.unicode().search(true);
  }
};

yielding: URI('?a=%CB%D1').getQueryParameters() == { a="ËÑ"}

I'm not sure if I'd want this to happen automatically, internally, without the implementor even noticing…

Member

rodneyrehm commented May 22, 2013

well, URI.js supports UTF8 and ISO 8859 mode. You could easily wrap things:

URI.prototype.getQueryParameters = function() {
  var uri = URI(this.search());
  try {
    return uri.search(true);
  } catch(e) {
    return uri.unicode().search(true);
  }
};

yielding: URI('?a=%CB%D1').getQueryParameters() == { a="ËÑ"}

I'm not sure if I'd want this to happen automatically, internally, without the implementor even noticing…

@rodneyrehm

This comment has been minimized.

Show comment
Hide comment
@rodneyrehm

rodneyrehm May 27, 2013

Member

See #92 as well

Member

rodneyrehm commented May 27, 2013

See #92 as well

@axtens

This comment has been minimized.

Show comment
Hide comment
@axtens

axtens Jun 28, 2013

This issue's popped up again and I'm trying to figure out how to get around it.
The URL in this case is
var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= %CB%D1 %CB%F7 &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
and the code which is breaking (with the same error and error-location as above)

var uri = new URI(url); 
//...
var uQuery = uri.clone().setQuery("");

It's the setQuery that's failing. How do I set my query to nothing without using setQuery()?

axtens commented Jun 28, 2013

This issue's popped up again and I'm trying to figure out how to get around it.
The URL in this case is
var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= %CB%D1 %CB%F7 &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
and the code which is breaking (with the same error and error-location as above)

var uri = new URI(url); 
//...
var uQuery = uri.clone().setQuery("");

It's the setQuery that's failing. How do I set my query to nothing without using setQuery()?

@axtens

This comment has been minimized.

Show comment
Hide comment
@axtens

axtens Jun 28, 2013

Ok, simple answer: var uQuery = uri.clone().search("");

axtens commented Jun 28, 2013

Ok, simple answer: var uQuery = uri.clone().search("");

@rodneyrehm

This comment has been minimized.

Show comment
Hide comment
@rodneyrehm

rodneyrehm Aug 3, 2013

Member

I've fixed this in master - it will be included in the next release. thank you for your help!

QueryString data that cannot be decoded will now simply be returned undecoded - that way any decodable data can still be of use.

Member

rodneyrehm commented Aug 3, 2013

I've fixed this in master - it will be included in the next release. thank you for your help!

QueryString data that cannot be decoded will now simply be returned undecoded - that way any decodable data can still be of use.

@rodneyrehm rodneyrehm closed this Aug 3, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment