Skip to content

Commit

Permalink
Add support for more Unicode characters
Browse files Browse the repository at this point in the history
When attempting to parse an iCalendar ICS export from Google, I discovered a few Unicode
code points in event `DESCRIPTION`s that were breaking the parser.

The three that I directly observed that broke the parser:
- U+0009 Horizontal Tab - https://unicode.org/cldr/utility/character.jsp?a=0009
- U+200B Zero Width Space - https://unicode.org/cldr/utility/character.jsp?a=200B
- U+00AD Soft Hyphen - https://unicode.org/cldr/utility/character.jsp?a=00ad

Based on my understanding, according to RFC5545[0], these characters should be supported.
However, `unicode.IsGraphic` does not return true for them.

This may be a hacky way to support it, but I think supporting all non-control unicode Characters is
closer to the spec without being too verbose/complex.

---

Rationale for support:

contentline   = name *(";" param ) ":" value CRLF
value         = *VALUE-CHAR
VALUE-CHAR    = WSP / %x21-7E / NON-US-ASCII
NON-US-ASCII  = UTF8-2 / UTF8-3 / UTF8-4

- U+009 is defined in WSP/HTAB in RFC5234 https://tools.ietf.org/html/rfc5234
- U+200B/U+00AD is defined by UTF8-2/UTF8-3/UTF8-4 in RFC3629 https://tools.ietf.org/html/rfc3629
  • Loading branch information
minglecm committed Aug 15, 2018
1 parent cad1fe8 commit e48195c
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 1 deletion.
33 changes: 33 additions & 0 deletions fixtures/example.ics
Expand Up @@ -12,4 +12,37 @@ CATEGORIES:CONFERENCE
SUMMARY:Networld+Interop Conference
DESCRIPTION:Networld+Interop Conference and Exhibit\nAtlanta World Congress Center\n Atlanta\, Georgia
END:VEVENT
BEGIN:VEVENT
DTSTAMP:19960704T120000Z
UID:uid1@example.com
ORGANIZER:mailto:jsmith@example.com
DTSTART:19960918T143000Z
DTEND:19960920T220000Z
STATUS:CONFIRMED
CATEGORIES:CONFERENCE
SUMMARY:Contains a tab
DESCRIPTION:between the carets > < is a horizontal tab
END:VEVENT
BEGIN:VEVENT
DTSTAMP:19960704T120000Z
UID:uid1@example.com
ORGANIZER:mailto:jsmith@example.com
DTSTART:19960918T143000Z
DTEND:19960920T220000Z
STATUS:CONFIRMED
CATEGORIES:CONFERENCE
SUMMARY:Contains a zero width space
DESCRIPTION:between the carets >​< is a zero with space
END:VEVENT
BEGIN:VEVENT
DTSTAMP:19960704T120000Z
UID:uid1@example.com
ORGANIZER:mailto:jsmith@example.com
DTSTART:19960918T143000Z
DTEND:19960920T220000Z
STATUS:CONFIRMED
CATEGORIES:CONFERENCE
SUMMARY:Contains a soft hyphen
DESCRIPTION:between the carets >­< is a soft hyphen
END:VEVENT
END:VCALENDAR
6 changes: 5 additions & 1 deletion lex.go
Expand Up @@ -346,7 +346,7 @@ func lexValue(l *lexer) stateFn {
Loop:
for {
switch r := l.next(); {
case unicode.IsGraphic(r):
case isValueChar(r):
// absorb
default:
l.backup()
Expand All @@ -372,6 +372,10 @@ func isSafeChar(r rune) bool {
return !unicode.IsControl(r) && r != '"' && r != ';' && r != ':' && r != ','
}

func isValueChar(r rune) bool {
return r == '\t' || (!unicode.IsControl(r) && utf8.ValidRune(r))
}

// item helpers

// isItemName checks if the item is an ical name
Expand Down

0 comments on commit e48195c

Please sign in to comment.