Unescaping URI doesn't recreate unicode characters #86

hesselink · 2013-02-06T12:17:51Z

Since 23c1007 (see #51), unicode characters in strings are escaped the way browsers do: encode as UTF8, and use the bytes from that to percent-encode. However, during unescaping no UTF8-decoding is done. This means that (decode . encode) is not the identity, and results in malformed strings. For example, if I take the string from test case testEscapeURIString06 and unescape it, I get:

> putStrLn $ unEscapeString (escapeURIString isUnescapedInURIComponent "helloø©日本")
helloÃ¸Â©æ�¥æ�¬

The text was updated successfully, but these errors were encountered:

tibbe · 2013-02-06T18:20:40Z

@singpolyma Could you please take a look?

stepcut · 2013-03-01T23:00:14Z

Next time you completely change the behavior of a function like unEscapeString completely breaking any code that requires the behavior that has been present for a decade, it would be nice if you didn't give the package the tiniest version number bump possible.

tibbe · 2013-03-01T23:24:09Z

My apologies. Do we need a new point release to reinstate the old behavior (and make a new release the introduces the new behavior under a major version bump).

stepcut · 2013-03-02T00:34:28Z

I'm not sure. Releasing a new point release might be nice. People who have not yet upgraded to network 2.4.1.{1|2} would then get a working version like 2.4.1.3 if they got updated to the lastest 2.4.* network. People who have the 'broken' versions wouldn't see the benefit until they get upgraded of course.

For my particular needs.. I am just going to update happstack-server and web-routes and release new versions. I will probably just stop using the unEscapeString function so that I can allow a wider range of versions for network and not have to worry about the changing unEscapeString behavior. Once I do that, it won't really make any difference to me whether you release a 2.4.1.3 or not.

Looking at the Haskell community as a whole.. there are two types of programs out there.

programs that worked correctly because they understood the old behavior and dealt with it.
programs that worked incorrectly before, but are now fixed by this.

Anything in #1 is probably now broken.

I am also a bit wishy-washy about whether the new behavior is entirely correct. I guess the old behavior was just 'percentDecode' and the new behavior is 'percentDecodeAndUtf8Decode'. For URIs you pretty much always want the latter. But, in theory, you might have been using 'unEscapeString' to handle something like this:

Content-Type: application/x-www-form-urlencoded; charset=binary

Though, that is unlike to come up in practice I think. I'm guessing most browsers ignore the charset=binary in a form tag. I think I've experimented with that before.

Ultimately, you have the problem of notifying developers that something significant changed in unEscapeString. I think Gregory Collins and I (or perhaps it was mightybyte) independently came to the conclusion that the only safe way to do it is to create a new function with a new name for the new behavior, and then mark the old version as deprecated. This will at least force developers to think for a minute and maybe read some documentation about what they need to do to change their code for the new function. As it stands now, even if you had waited until network 2.5 to make the changes, most people would have just checked that their code compiled and assumed everything was A-OK. The downside is that 'unEscapeString' is a pretty good name for the new function. You don't really want to have to call it something like 'unEscapeAndDecodeString'.

Another option might be to add a WARNING pragma for a few releases after you update network. It is not as noticeable as the function disappearing, but is more likely to be seen than some changelog entry on github?

Anyway. In theory, releasing a network 2.4.1.3 with the old behavior and a 2.5 with the new behavior would minimize the damages, provided that other people are actually following the PVP and that they know they need to update their code when they switch to 2.5. If they don't know they need to update their code.. then releasing 2.4.1.3/2.5 would just delay the problem slightly.

So, it's up to you I guess. I'll have new updated packages out soon enough that it won't matter to me anymore. A network 2.5 with unEscapeText would be nifty though ;) But, at the same time, adding the text dependency is not thrilling.

hesselink · 2013-03-02T08:18:36Z

While it probably would have been good to use a major version bump for changes like this, I'd prefer it to leave things as they are now. I've already had to update our packages twice. Reverting and doing a new major release requires extra code, tricky cabal version constraints and more work again for us. I'd also like to note that when I upgraded to network 2.4, I overlooked the initial change and broke our application (for a very short while). So even a major bump doesn't always help. I did mean that we discovered the cause quickly, since there was an explicit action on our part that caused it: relaxing the dependency on network.

For the future, I'd prefer:

A major release for changes like these. They are API changing, even if the types don't tell you that.
A changelog listing the changes. I often look at the diffs between version, but there is so much noise and unfamiliarity with the code that it is hard to find out what has changed.

tibbe · 2013-03-03T22:51:54Z

I will leave things as they are for now and try to make sure we bump the major version number next time.

singpolyma mentioned this issue Feb 6, 2013

TDD'd #86 #87

Merged

tibbe closed this as completed in f2168b1 Feb 6, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescaping URI doesn't recreate unicode characters #86

Unescaping URI doesn't recreate unicode characters #86

hesselink commented Feb 6, 2013

tibbe commented Feb 6, 2013

stepcut commented Mar 1, 2013

tibbe commented Mar 1, 2013

stepcut commented Mar 2, 2013

hesselink commented Mar 2, 2013

tibbe commented Mar 3, 2013

Unescaping URI doesn't recreate unicode characters #86

Unescaping URI doesn't recreate unicode characters #86

Comments

hesselink commented Feb 6, 2013

tibbe commented Feb 6, 2013

stepcut commented Mar 1, 2013

tibbe commented Mar 1, 2013

stepcut commented Mar 2, 2013

hesselink commented Mar 2, 2013

tibbe commented Mar 3, 2013