encoding issue with gb2312 data #87

Closed
axtens opened this Issue May 21, 2013 · 7 comments

Projects

None yet

2 participants

axtens commented May 21, 2013 edited by rodneyrehm

Context: Microsoft JScript on Windows Server 2008 R2 64bit

var url = "http://www.google.com.hk/search?q=pennytel%20downloads&sa=%20%CB%D1%20%CB%F7%20&forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
var x = new URI(url);
var rMap = x.search(true);

When the .search is executed I get

Microsoft JScript runtime error: The URI to be decoded is not a valid encoding

The break occurs here

d.decodeQuery = function (a) {
    return d.decode((a + "").replace(/+/g, "%20"))
};

and it's probably complaining about the "sa=%20%CB%D1%20%CB%F7%20". What's amiss here? Is it fixable? Is it an encoding issue or something else?

Owner

I can reproduce the issue in Firefox 21 on Mac. This sequence is the problem %CB%D1 - it can't be decoded by decodeURIComponent().

decodeURIComponent() expects UTF-8 escape sequences and fails if it can't resolve the input. Using unescape() the sequence resolves to ËÑ, which would properly be percent-encoded as %C3%8B%C3%91

Can you check what character's this sequence should resolve to? Can you make sure that the data is UTF-8?

axtens commented May 22, 2013 edited by rodneyrehm

As far as I can tell, given the ie and oe variables (&ie=GB2312&oe=GB2312), the characters are GB2312 encoded chinese characters. If I store ËÑË÷ in a text file and, using BabelPad, read them in as GB2312, I get 脣脩脣梅. That expressed as UTF-8 is, in hex, E8 84 A3 E8 84 A9 E8 84 A3 E6 A2 85.

Now, how to deal with this is tricky because the original url has come into our website via Google Hong Kong so we have no way of controlling how the data is encoded. Do I change URI.js to use unescape? At the moment, I run every url through unescape() anyway so that URI.js doesn't crash on the weird ones.

Owner

well, URI.js supports UTF8 and ISO 8859 mode. You could easily wrap things:

URI.prototype.getQueryParameters = function() {
  var uri = URI(this.search());
  try {
    return uri.search(true);
  } catch(e) {
    return uri.unicode().search(true);
  }
};

yielding: URI('?a=%CB%D1').getQueryParameters() == { a="ËÑ"}

I'm not sure if I'd want this to happen automatically, internally, without the implementor even noticing…

Owner

See #92 as well

axtens commented Jun 28, 2013 edited by rodneyrehm

This issue's popped up again and I'm trying to figure out how to get around it.
The URL in this case is
var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= %CB%D1 %CB%F7 &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
and the code which is breaking (with the same error and error-location as above)

var uri = new URI(url); 
//...
var uQuery = uri.clone().setQuery("");

It's the setQuery that's failing. How do I set my query to nothing without using setQuery()?

axtens commented Jun 28, 2013

Ok, simple answer: var uQuery = uri.clone().search("");

Owner

I've fixed this in master - it will be included in the next release. thank you for your help!

QueryString data that cannot be decoded will now simply be returned undecoded - that way any decodable data can still be of use.

@rodneyrehm rodneyrehm closed this Aug 3, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment