Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize consistent cross-platform line endings across plist, glif, and feature files #172

Merged
merged 7 commits into from
Sep 9, 2021

Conversation

chrissimpkins
Copy link
Collaborator

@chrissimpkins chrissimpkins commented Aug 26, 2021

Supersedes #170

Based on discussion in #162, we will write line feeds in all UFO files on all platforms by default.

This branch is based on #171 (edit: #171 now merged into master branch)


TODO

Confirm that line feeds are serialized by default in:

@chrissimpkins
Copy link
Collaborator Author

chrissimpkins commented Aug 26, 2021

I'm not sure what was going on with the previous tests. Either expected test file strings had incorrectly formatted line endings or fs::read is converting line endings to different platform defaults? In any case, it appears that plist files and glif files serialize with \n line endings by default across platforms in the new tests here. So no change was required to make that the default line ending approach. This PR adds feature file carriage return removal normalization on serialization. This appears to be the only change required to get to \n line ending serialization across all UFO file types on all platforms. glif, plist, and fea tests added, including a new test UFO dir that has all files formatted with \r\n line endings to confirm that they are changed during a R/W round trip.

Ready for review

cc @khaledhosny

src/font.rs Outdated
// Normalize feature files with line feed line endings
// This is consistent with the line endings serialized in glif and plist files
if features.contains('\r') {
fs::write(path.join(FEATURES_FILE), features.replace("\r", ""))?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose no one uses old mac line endings anymore, but in case someone does this would break the file, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought last night and had the change ready to push but held off based on your ‘it is 2021’ mantra. :)

Can push it if you think that anything out there could possibly write carriage returns only.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I have seen \r used as line ending char it was VOLT (which is a Windows tool written by MS, go figure), so not relevant here. But even though it is 2021 Postel's Law might still apply to this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm 234287e

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think VTT table dumps (in fontTools' TTX format) contain \r, but the two characters and not \u000D. I think the data folder shouldn't be touched anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait. It actually contains literal \u000D, see https://github.com/daltonmaag/ubuntu/blob/master/source/Ubuntu-B.ufo/data/com.github.fonttools.ttx/T_S_I__1.ttx. Maybe because VTT started life on very old Macs that still used \r?

I suppose mixing LF and CRLF in any part of the UFO (glyph lib or elsewhere) is a bad idea so I guess this PR is fine, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data will be read in as byte arrays?

Maybe because VTT started life on very old Macs that still used \r?

I did not know this!

@chrissimpkins
Copy link
Collaborator Author

chrissimpkins commented Aug 26, 2021

I benched the String::contains check because I was concerned that in the worst case scenario there would be two full document char traversals to achieve the replacement. I think that it is worth keeping because this will be run over every fea file at serialization and the most common case will be no carriage returns in the file. In that situation the code runs significantly faster with the contains char check than without it.

Here are the data where contains=includes the contains check and no contains=running the string replacement over the string without the check:

File without carriage returns (most common case)

test tests::benchmark_contains_without_cr           ... bench:         261 ns/iter (+/- 20)
test tests::benchmark_no_contains_without_cr        ... bench:       1,458 ns/iter (+/- 104)

File with carriage return + line feed on the last line only (worst case, but highly unlikely)

test tests::benchmark_contains_with_cr_last_line    ... bench:       1,765 ns/iter (+/- 117)
test tests::benchmark_no_contains_with_cr_last_line ... bench:       1,635 ns/iter (+/- 124)

File with carriage return + line feed on all lines (most likely case when carriage returns are used)

test tests::benchmark_contains_with_cr              ... bench:       2,955 ns/iter (+/- 212)
test tests::benchmark_no_contains_with_cr           ... bench:       2,962 ns/iter (+/- 241)

Notably the performance improvement is contingent on using a char in the contains check. If you convert it to a string slice there is a significant performance impact a/w the contains check in the worst case scenario (~ 2-fold slower vs. not using the check). The overall impact is clearly low. This is one file/fea String per UFO master with differences in the microsecond range.

chrissimpkins added a commit to chrissimpkins/norad that referenced this pull request Aug 26, 2021
@chrissimpkins chrissimpkins changed the title Serialize consistent cross-platform line endings Serialize consistent cross-platform line endings across plist, glif, and feature files Aug 26, 2021
@chrissimpkins chrissimpkins linked an issue Aug 26, 2021 that may be closed by this pull request
@chrissimpkins chrissimpkins mentioned this pull request Aug 26, 2021
chrissimpkins added a commit to chrissimpkins/norad that referenced this pull request Aug 31, 2021
@chrissimpkins
Copy link
Collaborator Author

Thoughts about this approach @cmyr @madig?

@cmyr
Copy link
Member

cmyr commented Sep 2, 2021

Sorry to let this sit, my attention has been elsewhere. I'm off the computer shortly for some irl appointments but I'll try to take a look tonight or tomorrow!

@cmyr cmyr self-requested a review September 2, 2021 16:02
Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me!

src/font.rs Outdated
fs::write(path.join(FEATURES_FILE), features)?;
// Normalize feature files with line feed line endings
// This is consistent with the line endings serialized in glif and plist files
if features.contains('\r') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought, and maybe the compiler is smart enough to optimize this, but I would prefer to scan for the \r byte over the \r char; scanning chars means decoding utf-8, whereas scanning bytes doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other thought (and this is totally unnecessary) is that if we really wanted to be perf nerds, the better approach would probably be to do this replacement when reading, instead of when writing.

Basically:

  • instead of something like fs::read_to_string, manually open the fea as a File
  • allocate two buffers
  • read chunks from the file into the first buffer. on each read, scan the chunk for \r
  • if \r is encountered, then manually recopy the lines into the second buffer, skipping an \r
  • if \r is never encountered, just return the first buffer untouched as the result, else return the second buffer.

this is dumb and definitely not worth it. 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanning chars means decoding utf-8

I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.

fn load_features(features_path: &Path) -> Result<String, Error> {
    let features = fs::read_to_string(features_path)?;
    Ok(features)
}

Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to scan for the \r byte over the \r char

Are you referring to something along the lines of searching for b"\r" in features.as_bytes()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanning chars means decoding utf-8

what I mean is simply that the chars() iter has to determine character boundaries in the underlying bytes, which has some overhead. This is on top of the overhead of validating the utf-8 when it's first read.

I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.

Yes, I believe we read directly to string currently, which handles validation. We could alternatively read to bytes and then validate ourselves, to equal effect.

fn load_features(features_path: &Path) -> Result<String, Error> {

    let features = fs::read_to_string(features_path)?;

    Ok(features)

}

Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.

This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted to a byte check in 2e84bce

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...

Response from Josh Hadley (Adobe):

Yes, FEA files should indeed use UTF-8 encoding throughout. This has been an unwritten (well…partially written) assumption for some time now but as of the latest AFDKO (3.7.1), that assumption is enforced by the new Antlr4-based parser in makeotfexe. Older versions of makeotfexe might have been lax on that point, but that will no longer be the case moving forward. Also worth noting that makeotf (the Python interface to makeotfexe) has effectively enforced this for a while already.

But as you note, the OpenType Feature File Specification does not actually explicitly state that the whole file should be UTF-8. We will clarify that in an update soon.

@chrissimpkins
Copy link
Collaborator Author

New benchmarks with conversion to a byte check from char check (#172 (comment)):

test tests::benchmark_bytes_contains_with_cr           ... bench:       2,119 ns/iter (+/- 374)
test tests::benchmark_bytes_contains_with_cr_last_line ... bench:       1,381 ns/iter (+/- 78)
test tests::benchmark_bytes_contains_without_cr        ... bench:         257 ns/iter (+/- 26)

test tests::benchmark_contains_with_cr                 ... bench:       2,201 ns/iter (+/- 200)
test tests::benchmark_contains_with_cr_last_line       ... bench:       1,392 ns/iter (+/- 107)
test tests::benchmark_contains_without_cr              ... bench:         256 ns/iter (+/- 46)

test tests::benchmark_no_contains_with_cr              ... bench:       2,130 ns/iter (+/- 123)
test tests::benchmark_no_contains_with_cr_last_line    ... bench:       1,200 ns/iter (+/- 73)
test tests::benchmark_no_contains_without_cr           ... bench:       1,029 ns/iter (+/- 145)

Appears to have the same performance within the margin of error.

@madig
Copy link
Collaborator

madig commented Sep 9, 2021

@chrissimpkins can you please clarify how I need to read your benchmark results?

@chrissimpkins
Copy link
Collaborator Author

chrissimpkins commented Sep 9, 2021

@chrissimpkins can you please clarify how I need to read your benchmark results?

#172 (comment) is the original run with a description. The latest tests added a third test group with the new approach to check for the carriage return byte rather than char.

I ran tests against test strings that included:

  • with_cr = carriage returns and line feeds on every line
  • with_cr_last_line = carriage return and line feed on the last line only (to test a full string traversal before the CR is identified), all other lines are line feed only
  • without_cr = no carriage returns, only uses line feeds

The test is whether the contains check before replacement is useful in any of the above circumstances and these tests were grouped as follows:

  • bytes_contains = input.as_bytes().contains(&b'\r') byte check
  • contains = input.contains('\r') char check
  • no_contains = no check for the carriage return, we just run the replacement across the entire string

Full repro source in the attached archive

benchy.zip

@chrissimpkins
Copy link
Collaborator Author

chrissimpkins commented Sep 9, 2021

IMO the tl;dr takeaway is that, in the vastly most common case of fea files with line feeds only, there is an improvement in execution time if you do gate the string replacement on a CR check (~250ns/iter vs. ~1000ns/iter). It appears that char and byte checks run in ~ the same time.

All tested on MBP 2017 15" macOS 10.15.7 with current rustc nightly

@chrissimpkins
Copy link
Collaborator Author

We are dealing with very short times and max one string per ufo master so this is not going to break anything. Happy to remove the check if anyone feels strongly about more concise/clean source. It was not intuitive to me that this would be the finding.

Copy link
Member

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for measuring!

@chrissimpkins chrissimpkins merged commit e01c204 into linebender:master Sep 9, 2021
@chrissimpkins chrissimpkins deleted the line-endings branch September 9, 2021 14:33
@chrissimpkins
Copy link
Collaborator Author

chrissimpkins commented Sep 10, 2021

Came across this in the Adobe OT fea spec documentation:

You can have multiple line endings, spaces, and tabs between tokens. Macintosh, UNIX and PC line endings are all supported.

Recording it here for posterity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-platform line ending serialization approach
4 participants