Skip to content

Conversation

@dilyanpalauzov
Copy link
Contributor

https://tools.ietf.org/html/rfc5545#section-3.1 says:

Note: It is possible for very simple implementations to generate
improperly folded lines in the middle of a UTF-8 multi-octet
sequence. For this reason, implementations need to unfold lines
in such a way to properly restore the original sequence.

ical.js used to fold in the middle of a UTF-8 multi-octet sequence.

@dilyanpalauzov dilyanpalauzov changed the title helpers.foldline: do not fold in the middle of UTF-8 characters helpers.foldline: do not isert spaces between the high and low UTF-16 surrogates May 10, 2020
@dilyanpalauzov dilyanpalauzov force-pushed the utf8_folding branch 2 times, most recently from 9206dbf to 9f3e0c4 Compare May 10, 2020 15:39
@dilyanpalauzov
Copy link
Contributor Author

dilyanpalauzov commented May 10, 2020

I was wrong with the above comment, here the updated one:

ical.js used to fold in the middle of a UTF-16 code point sequence. To be precise, it inserted a space and a new line after the high UTF-16 surrogate and before the low UTF-16 surrogate. This created invalid stand-alone surrogates which the JavaScript engine convers to something valid.

This is the problem, with Node.js:

import {writeFileSync} from 'fs';

const a = '\uD83D\uDCAA'  //this is 💪 'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F 0x92 0xAA, HTML Entity: 💪 💪 UTF-16 encoding: 0xD83D 0xDCAA
writeFileSync('a1', a, 'utf8')  //file contains the bytes f0 9f 92 aa
writeFileSync('a2', a.charAt(0) + a.charAt(1), 'utf8') //file contains the bytes f0 9f 92 aa = REPLACEMENT CHARACTER = �
writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes 0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBD

so the result in b is valid UTF-8, but it has nothing in commont with the original text and it cannot be suspected, that anyform of reconstructing b to a shall be tried.

This patch takes a full UTF-16 character from the input,
• calculates whether it takes one or two UTF-16 chars, and keeps in pos where the next full UTF-16 character starts,
• calculates for the full UTF-16 code point, whether it needs 1, 2, 3 or 4
bytes to be presened in UTF-8, keeping in line_length the bytes for UTF-8 so far necessary,
• splits the line, when the UTF-8 presentation exceeds ICAL.foldLength bytes

@dilyanpalauzov dilyanpalauzov force-pushed the utf8_folding branch 5 times, most recently from a100ac3 to c097c7d Compare May 10, 2020 16:04
@dilyanpalauzov dilyanpalauzov changed the title helpers.foldline: do not isert spaces between the high and low UTF-16 surrogates helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates May 10, 2020
@dilyanpalauzov
Copy link
Contributor Author

Added two test cases to test/stringify_test.js:test('folding',… , that fail with the current code:

//ICAL.foldLength = 15; 
//'💪'  is in UTF-16 '\uD83D\uDCAA' but in UTF-8 this is F0 DF 92 AA.  If space/new line is inserted between the surrogates, then the JS Engine substitutes each stand-alone surrogate with REPLACEMENT CHARACTER 0xEF 0xBF 0xBD
subject.setValue('💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:" + N + '💪');//verify new line is after ':', as otherwise the whole line is longer than ICAL.foldLength
subject.setValue('aa💪💪a💪💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:aa" + N + '💪💪a💪' + N + '💪');//verify that 💪 is moved as whole to a new line as it is 4 UTF-8 bytes

@dilyanpalauzov
Copy link
Contributor Author

dilyanpalauzov commented May 11, 2020

Internet Explorer does not have String.prototype.codePointAt, which is used here. The code can be rewritten without this method, but it is unclear on what JavaScript engines ical.js is supposed to run.

@kewisch kewisch added the needinfo More information has been requested label Jul 13, 2021
…6 surrogates

ical.js used to fold in the middle of a UTF-16 code point sequence.  To be
precise, it inserted a space and a new line after the high UTF-16 surrogate
and before the low UTF-16 surrogate.  This created invalid stand-alone
surrogates which the JavaScript engine converts to something valid.

This is the problem, with Node.js:

import {writeFileSync} from 'fs';
const a = '\uD83D\uDCAA'  //this is 💪'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F
// 0x92 0xAA, HTML Entity: 💪 💪 UTF-16 encoding: 0xD83D 0xDCAA
writeFileSync('a1', a, 'utf8')  //file contains the bytes f0 9f 92 aa
writeFIlesync('a2', a.charAt(0) + a.charAt(1), 'utf8') //the same as above
writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes
0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBD = REPLACEMENT CHARACTER = �

so the result in b is valid UTF-8, but it has nothing in common with the
original text and it cannot be suspected, that anyform of reconstructing b
to a shall be tried.

This patch takes a full UTF-16 character from the input,
• calculates whether it takes one or two UTF-16 chars, and keeps in pos
where the next full UTF-16 character starts,
• calculates for the full UTF-16 code point, whether it needs 1, 2, 3 or 4
bytes to be presened in UTF-8, keeping in line_length the bytes for UTF-8
so far necessary,
• splits the line, when the UTF-8 presentation exceeds ICAL.foldLength bytes
@kewisch kewisch merged commit 44e9a2c into kewisch:master Aug 1, 2021
@dilyanpalauzov dilyanpalauzov deleted the utf8_folding branch August 1, 2021 21:55
@dilyanpalauzov
Copy link
Contributor Author

When will a new version be released?

@kewisch kewisch removed the needinfo More information has been requested label Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants