helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439

dilyanpalauzov · 2020-04-25T20:34:42Z

https://tools.ietf.org/html/rfc5545#section-3.1 says:

Note: It is possible for very simple implementations to generate
improperly folded lines in the middle of a UTF-8 multi-octet
sequence. For this reason, implementations need to unfold lines
in such a way to properly restore the original sequence.

ical.js used to fold in the middle of a UTF-8 multi-octet sequence.

dilyanpalauzov · 2020-05-10T15:42:18Z

I was wrong with the above comment, here the updated one:

ical.js used to fold in the middle of a UTF-16 code point sequence. To be precise, it inserted a space and a new line after the high UTF-16 surrogate and before the low UTF-16 surrogate. This created invalid stand-alone surrogates which the JavaScript engine convers to something valid.

This is the problem, with Node.js:

import {writeFileSync} from 'fs';

const a = '\uD83D\uDCAA'  //this is 💪 'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F 0x92 0xAA, HTML Entity: &#128170; &#x1F4AA; UTF-16 encoding: 0xD83D 0xDCAA
writeFileSync('a1', a, 'utf8')  //file contains the bytes f0 9f 92 aa
writeFileSync('a2', a.charAt(0) + a.charAt(1), 'utf8') //file contains the bytes f0 9f 92 aa = REPLACEMENT CHARACTER = �
writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes 0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBD

so the result in b is valid UTF-8, but it has nothing in commont with the original text and it cannot be suspected, that anyform of reconstructing b to a shall be tried.

This patch takes a full UTF-16 character from the input,
• calculates whether it takes one or two UTF-16 chars, and keeps in pos where the next full UTF-16 character starts,
• calculates for the full UTF-16 code point, whether it needs 1, 2, 3 or 4
bytes to be presened in UTF-8, keeping in line_length the bytes for UTF-8 so far necessary,
• splits the line, when the UTF-8 presentation exceeds ICAL.foldLength bytes

dilyanpalauzov · 2020-05-10T19:11:13Z

Added two test cases to test/stringify_test.js:test('folding',… , that fail with the current code:

//ICAL.foldLength = 15; 
//'💪'  is in UTF-16 '\uD83D\uDCAA' but in UTF-8 this is F0 DF 92 AA.  If space/new line is inserted between the surrogates, then the JS Engine substitutes each stand-alone surrogate with REPLACEMENT CHARACTER 0xEF 0xBF 0xBD
subject.setValue('💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:" + N + '💪');//verify new line is after ':', as otherwise the whole line is longer than ICAL.foldLength
subject.setValue('aa💪💪a💪💪');
assert.equal(ICAL.stringify.property(subject.toJSON(), ICAL.design.icalendar, false), "DESCRIPTION:aa" + N + '💪💪a💪' + N + '💪');//verify that 💪 is moved as whole to a new line as it is 4 UTF-8 bytes

dilyanpalauzov · 2020-05-11T01:34:55Z

Internet Explorer does not have String.prototype.codePointAt, which is used here. The code can be rewritten without this method, but it is unclear on what JavaScript engines ical.js is supposed to run.

lib/ical/helpers.js

…6 surrogates ical.js used to fold in the middle of a UTF-16 code point sequence. To be precise, it inserted a space and a new line after the high UTF-16 surrogate and before the low UTF-16 surrogate. This created invalid stand-alone surrogates which the JavaScript engine converts to something valid. This is the problem, with Node.js: import {writeFileSync} from 'fs'; const a = '\uD83D\uDCAA' //this is 💪'Flexed Biceps', UTF-8 encoding: 0xF0 0x9F // 0x92 0xAA, HTML Entity: 💪 💪 UTF-16 encoding: 0xD83D 0xDCAA writeFileSync('a1', a, 'utf8') //file contains the bytes f0 9f 92 aa writeFIlesync('a2', a.charAt(0) + a.charAt(1), 'utf8') //the same as above writeFileSync('b', a.charAt(0) + ' ' + a.charAt(1), 'utf8') //file contains the bytes 0xEF 0xBF 0xBD 0x20 0xEF 0xBF 0xBD = REPLACEMENT CHARACTER = � so the result in b is valid UTF-8, but it has nothing in common with the original text and it cannot be suspected, that anyform of reconstructing b to a shall be tried. This patch takes a full UTF-16 character from the input, • calculates whether it takes one or two UTF-16 chars, and keeps in pos where the next full UTF-16 character starts, • calculates for the full UTF-16 code point, whether it needs 1, 2, 3 or 4 bytes to be presened in UTF-8, keeping in line_length the bytes for UTF-8 so far necessary, • splits the line, when the UTF-8 presentation exceeds ICAL.foldLength bytes

dilyanpalauzov · 2021-08-03T09:57:34Z

When will a new version be released?

dilyanpalauzov force-pushed the utf8_folding branch from caf51dd to 5028f4c Compare May 10, 2020 15:33

dilyanpalauzov changed the title ~~helpers.foldline: do not fold in the middle of UTF-8 characters~~ helpers.foldline: do not isert spaces between the high and low UTF-16 surrogates May 10, 2020

dilyanpalauzov force-pushed the utf8_folding branch 2 times, most recently from 9206dbf to 9f3e0c4 Compare May 10, 2020 15:39

dilyanpalauzov force-pushed the utf8_folding branch 5 times, most recently from a100ac3 to c097c7d Compare May 10, 2020 16:04

dilyanpalauzov changed the title ~~helpers.foldline: do not isert spaces between the high and low UTF-16 surrogates~~ helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates May 10, 2020

dilyanpalauzov force-pushed the utf8_folding branch from c097c7d to 3320823 Compare May 10, 2020 19:03

dilyanpalauzov force-pushed the utf8_folding branch from 3320823 to 0ad2dea Compare May 11, 2020 11:09

dilyanpalauzov mentioned this pull request Oct 13, 2020

VTIMEZONE / Timezone usage/handling #455

Closed

kewisch requested changes Jul 13, 2021

View reviewed changes

lib/ical/helpers.js Show resolved Hide resolved

kewisch added the needinfo More information has been requested label Jul 13, 2021

dilyanpalauzov force-pushed the utf8_folding branch from 0ad2dea to 78c7f3f Compare July 13, 2021 11:50

kewisch merged commit 44e9a2c into kewisch:master Aug 1, 2021

dilyanpalauzov deleted the utf8_folding branch August 1, 2021 21:55

dilyanpalauzov mentioned this pull request Nov 6, 2021

Please realease a new version #478

Closed

kewisch removed the needinfo More information has been requested label Dec 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439

helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439

Uh oh!

dilyanpalauzov commented Apr 25, 2020

Uh oh!

dilyanpalauzov commented May 10, 2020 •

edited

Loading

Uh oh!

dilyanpalauzov commented May 10, 2020

Uh oh!

dilyanpalauzov commented May 11, 2020 •

edited

Loading

Uh oh!

Uh oh!

dilyanpalauzov commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439

helpers.foldline: do not insert spaces between the high and low UTF-16 surrogates #439

Uh oh!

Conversation

dilyanpalauzov commented Apr 25, 2020

Uh oh!

dilyanpalauzov commented May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilyanpalauzov commented May 10, 2020

Uh oh!

dilyanpalauzov commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dilyanpalauzov commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dilyanpalauzov commented May 10, 2020 •

edited

Loading

dilyanpalauzov commented May 11, 2020 •

edited

Loading