-
Notifications
You must be signed in to change notification settings - Fork 148
helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
caf51dd to
5028f4c
Compare
9206dbf to
9f3e0c4
Compare
|
I was wrong with the above comment, here the updated one: ical.js used to fold in the middle of a UTF-16 code point sequence. To be precise, it inserted a space and a new line after the high UTF-16 surrogate and before the low UTF-16 surrogate. This created invalid stand-alone surrogates which the JavaScript engine convers to something valid. This is the problem, with Node.js: import {writeFileSync} from 'fs';
const a = '\uD83D\uDCAA' //this is 💪 'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F 0x92 0xAA, HTML Entity: 💪 💪 UTF-16 encoding: 0xD83D 0xDCAA
writeFileSync('a1', a, 'utf8') //file contains the bytes f0 9f 92 aa
writeFileSync('a2', a.charAt(0) + a.charAt(1), 'utf8') //file contains the bytes f0 9f 92 aa = REPLACEMENT CHARACTER = �
writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes 0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBDso the result in b is valid UTF-8, but it has nothing in commont with the original text and it cannot be suspected, that anyform of reconstructing b to a shall be tried. This patch takes a full UTF-16 character from the input, |
a100ac3 to
c097c7d
Compare
c097c7d to
3320823
Compare
|
Added two test cases to test/stringify_test.js:test('folding',… , that fail with the current code: //ICAL.foldLength = 15;
//'💪' is in UTF-16 '\uD83D\uDCAA' but in UTF-8 this is F0 DF 92 AA. If space/new line is inserted between the surrogates, then the JS Engine substitutes each stand-alone surrogate with REPLACEMENT CHARACTER 0xEF 0xBF 0xBD
subject.setValue('💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:" + N + '💪');//verify new line is after ':', as otherwise the whole line is longer than ICAL.foldLength
subject.setValue('aa💪💪a💪💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:aa" + N + '💪💪a💪' + N + '💪');//verify that 💪 is moved as whole to a new line as it is 4 UTF-8 bytes |
|
Internet Explorer does not have String.prototype.codePointAt, which is used here. The code can be rewritten without this method, but it is unclear on what JavaScript engines ical.js is supposed to run. |
3320823 to
0ad2dea
Compare
…6 surrogates
ical.js used to fold in the middle of a UTF-16 code point sequence. To be
precise, it inserted a space and a new line after the high UTF-16 surrogate
and before the low UTF-16 surrogate. This created invalid stand-alone
surrogates which the JavaScript engine converts to something valid.
This is the problem, with Node.js:
import {writeFileSync} from 'fs';
const a = '\uD83D\uDCAA' //this is 💪'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F
// 0x92 0xAA, HTML Entity: 💪 💪 UTF-16 encoding: 0xD83D 0xDCAA
writeFileSync('a1', a, 'utf8') //file contains the bytes f0 9f 92 aa
writeFIlesync('a2', a.charAt(0) + a.charAt(1), 'utf8') //the same as above
writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes
0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBD = REPLACEMENT CHARACTER = �
so the result in b is valid UTF-8, but it has nothing in common with the
original text and it cannot be suspected, that anyform of reconstructing b
to a shall be tried.
This patch takes a full UTF-16 character from the input,
• calculates whether it takes one or two UTF-16 chars, and keeps in pos
where the next full UTF-16 character starts,
• calculates for the full UTF-16 code point, whether it needs 1, 2, 3 or 4
bytes to be presened in UTF-8, keeping in line_length the bytes for UTF-8
so far necessary,
• splits the line, when the UTF-8 presentation exceeds ICAL.foldLength bytes
0ad2dea to
78c7f3f
Compare
|
When will a new version be released? |
https://tools.ietf.org/html/rfc5545#section-3.1 says:
Note: It is possible for very simple implementations to generate
improperly folded lines in the middle of a UTF-8 multi-octet
sequence. For this reason, implementations need to unfold lines
in such a way to properly restore the original sequence.
ical.js used to fold in the middle of a UTF-8 multi-octet sequence.