Serialize consistent cross-platform line endings across plist, glif, and feature files #172

chrissimpkins · 2021-08-26T02:31:41Z

Supersedes #170

Based on discussion in #162, we will write line feeds in all UFO files on all platforms by default.

This branch is based on #171 (edit: #171 now merged into master branch)

TODO

Confirm that line feeds are serialized by default in:

plist files (tests added in chrissimpkins@87879bf)
glif files (test added in chrissimpkins@05517ba)
fea files (x-platform normalization added in chrissimpkins@ec898eb, modified in 6548bb3)

chrissimpkins · 2021-08-26T04:06:46Z

I'm not sure what was going on with the previous tests. Either expected test file strings had incorrectly formatted line endings or fs::read is converting line endings to different platform defaults? In any case, it appears that plist files and glif files serialize with \n line endings by default across platforms in the new tests here. So no change was required to make that the default line ending approach. This PR adds feature file carriage return removal normalization on serialization. This appears to be the only change required to get to \n line ending serialization across all UFO file types on all platforms. glif, plist, and fea tests added, including a new test UFO dir that has all files formatted with \r\n line endings to confirm that they are changed during a R/W round trip.

Ready for review

cc @khaledhosny

khaledhosny · 2021-08-26T11:13:20Z

src/font.rs

+            // Normalize feature files with line feed line endings
+            // This is consistent with the line endings serialized in glif and plist files
+            if features.contains('\r') {
+                fs::write(path.join(FEATURES_FILE), features.replace("\r", ""))?;


I suppose no one uses old mac line endings anymore, but in case someone does this would break the file, no?

I had the same thought last night and had the change ready to push but held off based on your ‘it is 2021’ mantra. :)

Can push it if you think that anything out there could possibly write carriage returns only.

Last time I have seen \r used as line ending char it was VOLT (which is a Windows tool written by MS, go figure), so not relevant here. But even though it is 2021 Postel's Law might still apply to this case.

Sgtm 234287e

I think VTT table dumps (in fontTools' TTX format) contain \r, but the two characters and not \u000D. I think the data folder shouldn't be touched anyway.

Oh wait. It actually contains literal \u000D, see https://github.com/daltonmaag/ubuntu/blob/master/source/Ubuntu-B.ufo/data/com.github.fonttools.ttx/T_S_I__1.ttx. Maybe because VTT started life on very old Macs that still used \r?

I suppose mixing LF and CRLF in any part of the UFO (glyph lib or elsewhere) is a bad idea so I guess this PR is fine, though.

The data will be read in as byte arrays?

Maybe because VTT started life on very old Macs that still used \r?

I did not know this!

chrissimpkins · 2021-08-26T13:26:52Z

I benched the String::contains check because I was concerned that in the worst case scenario there would be two full document char traversals to achieve the replacement. I think that it is worth keeping because this will be run over every fea file at serialization and the most common case will be no carriage returns in the file. In that situation the code runs significantly faster with the contains char check than without it.

Here are the data where contains=includes the contains check and no contains=running the string replacement over the string without the check:

File without carriage returns (most common case)

test tests::benchmark_contains_without_cr           ... bench:         261 ns/iter (+/- 20)
test tests::benchmark_no_contains_without_cr        ... bench:       1,458 ns/iter (+/- 104)

File with carriage return + line feed on the last line only (worst case, but highly unlikely)

test tests::benchmark_contains_with_cr_last_line    ... bench:       1,765 ns/iter (+/- 117)
test tests::benchmark_no_contains_with_cr_last_line ... bench:       1,635 ns/iter (+/- 124)

File with carriage return + line feed on all lines (most likely case when carriage returns are used)

test tests::benchmark_contains_with_cr              ... bench:       2,955 ns/iter (+/- 212)
test tests::benchmark_no_contains_with_cr           ... bench:       2,962 ns/iter (+/- 241)

Notably the performance improvement is contingent on using a char in the contains check. If you convert it to a string slice there is a significant performance impact a/w the contains check in the worst case scenario (~ 2-fold slower vs. not using the check). The overall impact is clearly low. This is one file/fea String per UFO master with differences in the microsecond range.

addresses linebender#172 (comment)

chrissimpkins · 2021-09-02T15:58:41Z

Thoughts about this approach @cmyr @madig?

cmyr · 2021-09-02T16:00:01Z

Sorry to let this sit, my attention has been elsewhere. I'm off the computer shortly for some irl appointments but I'll try to take a look tonight or tomorrow!

cmyr

This works for me!

cmyr · 2021-09-03T13:47:33Z

src/font.rs

-            fs::write(path.join(FEATURES_FILE), features)?;
+            // Normalize feature files with line feed line endings
+            // This is consistent with the line endings serialized in glif and plist files
+            if features.contains('\r') {


just a thought, and maybe the compiler is smart enough to optimize this, but I would prefer to scan for the \r byte over the \r char; scanning chars means decoding utf-8, whereas scanning bytes doesn't.

My only other thought (and this is totally unnecessary) is that if we really wanted to be perf nerds, the better approach would probably be to do this replacement when reading, instead of when writing.

Basically:

instead of something like fs::read_to_string, manually open the fea as a File

allocate two buffers

read chunks from the file into the first buffer. on each read, scan the chunk for \r

if \r is encountered, then manually recopy the lines into the second buffer, skipping an \r

if \r is never encountered, just return the first buffer untouched as the result, else return the second buffer.

this is dumb and definitely not worth it. 🤷

scanning chars means decoding utf-8

I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.

fn load_features(features_path: &Path) -> Result<String, Error> { let features = fs::read_to_string(features_path)?; Ok(features) }

Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.

I would prefer to scan for the \r byte over the \r char

Are you referring to something along the lines of searching for b"\r" in features.as_bytes()?

scanning chars means decoding utf-8

what I mean is simply that the chars() iter has to determine character boundaries in the underlying bytes, which has some overhead. This is on top of the overhead of validating the utf-8 when it's first read.

I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.

Yes, I believe we read directly to string currently, which handles validation. We could alternatively read to bytes and then validate ourselves, to equal effect.

fn load_features(features_path: &Path) -> Result<String, Error> { let features = fs::read_to_string(features_path)?; Ok(features) }

Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.

This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...

converted to a byte check in 2e84bce

This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...

Response from Josh Hadley (Adobe):

Yes, FEA files should indeed use UTF-8 encoding throughout. This has been an unwritten (well…partially written) assumption for some time now but as of the latest AFDKO (3.7.1), that assumption is enforced by the new Antlr4-based parser in makeotfexe. Older versions of makeotfexe might have been lax on that point, but that will no longer be the case moving forward. Also worth noting that makeotf (the Python interface to makeotfexe) has effectively enforced this for a while already.

But as you note, the OpenType Feature File Specification does not actually explicitly state that the whole file should be UTF-8. We will clarify that in an update soon.

…dings fix test name update expected fea file format

all carriage returns are removed across all platforms

…e endings confirm that we convert these to \n in plist files on all platforms

addresses linebender#172 (comment)

chrissimpkins · 2021-09-09T03:19:18Z

New benchmarks with conversion to a byte check from char check (#172 (comment)):

test tests::benchmark_bytes_contains_with_cr           ... bench:       2,119 ns/iter (+/- 374)
test tests::benchmark_bytes_contains_with_cr_last_line ... bench:       1,381 ns/iter (+/- 78)
test tests::benchmark_bytes_contains_without_cr        ... bench:         257 ns/iter (+/- 26)

test tests::benchmark_contains_with_cr                 ... bench:       2,201 ns/iter (+/- 200)
test tests::benchmark_contains_with_cr_last_line       ... bench:       1,392 ns/iter (+/- 107)
test tests::benchmark_contains_without_cr              ... bench:         256 ns/iter (+/- 46)

test tests::benchmark_no_contains_with_cr              ... bench:       2,130 ns/iter (+/- 123)
test tests::benchmark_no_contains_with_cr_last_line    ... bench:       1,200 ns/iter (+/- 73)
test tests::benchmark_no_contains_without_cr           ... bench:       1,029 ns/iter (+/- 145)

Appears to have the same performance within the margin of error.

madig · 2021-09-09T12:17:42Z

@chrissimpkins can you please clarify how I need to read your benchmark results?

chrissimpkins · 2021-09-09T12:26:52Z

@chrissimpkins can you please clarify how I need to read your benchmark results?

#172 (comment) is the original run with a description. The latest tests added a third test group with the new approach to check for the carriage return byte rather than char.

I ran tests against test strings that included:

with_cr = carriage returns and line feeds on every line
with_cr_last_line = carriage return and line feed on the last line only (to test a full string traversal before the CR is identified), all other lines are line feed only
without_cr = no carriage returns, only uses line feeds

The test is whether the contains check before replacement is useful in any of the above circumstances and these tests were grouped as follows:

bytes_contains = input.as_bytes().contains(&b'\r') byte check
contains = input.contains('\r') char check
no_contains = no check for the carriage return, we just run the replacement across the entire string

Full repro source in the attached archive

benchy.zip

chrissimpkins · 2021-09-09T12:44:44Z

IMO the tl;dr takeaway is that, in the vastly most common case of fea files with line feeds only, there is an improvement in execution time if you do gate the string replacement on a CR check (~250ns/iter vs. ~1000ns/iter). It appears that char and byte checks run in ~ the same time.

All tested on MBP 2017 15" macOS 10.15.7 with current rustc nightly

chrissimpkins · 2021-09-09T13:35:49Z

We are dealing with very short times and max one string per ufo master so this is not going to break anything. Happy to remove the check if anyone feels strongly about more concise/clean source. It was not intuitive to me that this would be the finding.

cmyr

Looks good, thanks for measuring!

chrissimpkins · 2021-09-10T18:25:37Z

Came across this in the Adobe OT fea spec documentation:

You can have multiple line endings, spaces, and tabs between tokens. Macintosh, UNIX and PC line endings are all supported.

Recording it here for posterity

chrissimpkins mentioned this pull request Aug 26, 2021

Serialize consistent platform-specific line endings, add optional Win platform line ending normalization support #170

Closed

2 tasks

chrissimpkins marked this pull request as ready for review August 26, 2021 03:59

This was referenced Aug 26, 2021

Cross-platform line ending serialization approach #162

Closed

norad default UFO serialization format source-foundry/ufofmt#22

Open

khaledhosny reviewed Aug 26, 2021

View reviewed changes

chrissimpkins added a commit to chrissimpkins/norad that referenced this pull request Aug 26, 2021

[font] modify replacement to \r\n -> \n instead of \r removal

6548bb3

addresses linebender#172 (comment)

chrissimpkins changed the title ~~Serialize consistent cross-platform line endings~~ Serialize consistent cross-platform line endings across plist, glif, and feature files Aug 26, 2021

chrissimpkins linked an issue Aug 26, 2021 that may be closed by this pull request

Cross-platform line ending serialization approach #162

Closed

chrissimpkins mentioned this pull request Aug 26, 2021

Norad bloat checks #173

Closed

chrissimpkins added a commit to chrissimpkins/norad that referenced this pull request Aug 31, 2021

[font] modify replacement to \r\n -> \n instead of \r removal

234287e

addresses linebender#172 (comment)

chrissimpkins force-pushed the line-endings branch from 6548bb3 to 234287e Compare August 31, 2021 15:23

cmyr self-requested a review September 2, 2021 16:02

cmyr approved these changes Sep 3, 2021

View reviewed changes

chrissimpkins added 6 commits September 8, 2021 22:59

[write] add plist line ending tests

934a00e

add new test UFO directory with all files formatted with \r\n line en…

e594a5f

…dings fix test name update expected fea file format

[font] add fea file line ending normalization

d4cf8ad

all carriage returns are removed across all platforms

add glif serialization test with explicitly defined line ending check

5dbde4a

[write] convert to new UFO test dir that contains files with \r\n lin…

cc15081

…e endings confirm that we convert these to \n in plist files on all platforms

[font] modify replacement to \r\n -> \n instead of \r removal

231ab61

addresses linebender#172 (comment)

chrissimpkins force-pushed the line-endings branch from 234287e to 231ab61 Compare September 9, 2021 02:59

[font] convert to a byte check (from char)

2e84bce

addresses linebender#172 (comment)

madig mentioned this pull request Sep 9, 2021

Have a bunch of benchmarks #177

Open

cmyr approved these changes Sep 9, 2021

View reviewed changes

chrissimpkins merged commit e01c204 into linebender:master Sep 9, 2021

chrissimpkins deleted the line-endings branch September 9, 2021 14:33

This was referenced Sep 13, 2021

Serialize consistent cross-platform line endings across plist, glif, and feature files source-foundry/ufofmt#32

Closed

Cross-platform line feed line endings across all UFO files source-foundry/ufofmt#33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize consistent cross-platform line endings across plist, glif, and feature files #172

Serialize consistent cross-platform line endings across plist, glif, and feature files #172

chrissimpkins commented Aug 26, 2021 •

edited

Loading

chrissimpkins commented Aug 26, 2021 •

edited

Loading

khaledhosny Aug 26, 2021

chrissimpkins Aug 26, 2021

khaledhosny Aug 26, 2021 •

edited

Loading

chrissimpkins Aug 26, 2021 •

edited

Loading

madig Sep 9, 2021

madig Sep 9, 2021

chrissimpkins Sep 9, 2021 •

edited

Loading

chrissimpkins commented Aug 26, 2021 •

edited

Loading

chrissimpkins commented Sep 2, 2021

cmyr commented Sep 2, 2021

cmyr left a comment

cmyr Sep 3, 2021

cmyr Sep 3, 2021

chrissimpkins Sep 3, 2021 •

edited

Loading

chrissimpkins Sep 3, 2021 •

edited

Loading

cmyr Sep 4, 2021

cmyr Sep 4, 2021

chrissimpkins Sep 9, 2021

chrissimpkins Sep 11, 2021

chrissimpkins commented Sep 9, 2021

madig commented Sep 9, 2021

chrissimpkins commented Sep 9, 2021 •

edited

Loading

chrissimpkins commented Sep 9, 2021 •

edited

Loading

chrissimpkins commented Sep 9, 2021

cmyr left a comment

chrissimpkins commented Sep 10, 2021 •

edited

Loading

Serialize consistent cross-platform line endings across plist, glif, and feature files #172

Serialize consistent cross-platform line endings across plist, glif, and feature files #172

Conversation

chrissimpkins commented Aug 26, 2021 • edited Loading

TODO

chrissimpkins commented Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaledhosny Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

chrissimpkins Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrissimpkins Sep 9, 2021 • edited Loading

Choose a reason for hiding this comment

chrissimpkins commented Aug 26, 2021 • edited Loading

File without carriage returns (most common case)

File with carriage return + line feed on the last line only (worst case, but highly unlikely)

File with carriage return + line feed on all lines (most likely case when carriage returns are used)

chrissimpkins commented Sep 2, 2021

cmyr commented Sep 2, 2021

cmyr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrissimpkins Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

chrissimpkins Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrissimpkins commented Sep 9, 2021

madig commented Sep 9, 2021

chrissimpkins commented Sep 9, 2021 • edited Loading

chrissimpkins commented Sep 9, 2021 • edited Loading

chrissimpkins commented Sep 9, 2021

cmyr left a comment

Choose a reason for hiding this comment

chrissimpkins commented Sep 10, 2021 • edited Loading

chrissimpkins commented Aug 26, 2021 •

edited

Loading

chrissimpkins commented Aug 26, 2021 •

edited

Loading

khaledhosny Aug 26, 2021 •

edited

Loading

chrissimpkins Aug 26, 2021 •

edited

Loading

chrissimpkins Sep 9, 2021 •

edited

Loading

chrissimpkins commented Aug 26, 2021 •

edited

Loading

chrissimpkins Sep 3, 2021 •

edited

Loading

chrissimpkins Sep 3, 2021 •

edited

Loading

chrissimpkins commented Sep 9, 2021 •

edited

Loading

chrissimpkins commented Sep 9, 2021 •

edited

Loading

chrissimpkins commented Sep 10, 2021 •

edited

Loading