-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize consistent cross-platform line endings across plist, glif, and feature files #172
Conversation
I'm not sure what was going on with the previous tests. Either expected test file strings had incorrectly formatted line endings or fs::read is converting line endings to different platform defaults? In any case, it appears that plist files and glif files serialize with Ready for review cc @khaledhosny |
src/font.rs
Outdated
// Normalize feature files with line feed line endings | ||
// This is consistent with the line endings serialized in glif and plist files | ||
if features.contains('\r') { | ||
fs::write(path.join(FEATURES_FILE), features.replace("\r", ""))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose no one uses old mac line endings anymore, but in case someone does this would break the file, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought last night and had the change ready to push but held off based on your ‘it is 2021’ mantra. :)
Can push it if you think that anything out there could possibly write carriage returns only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I have seen \r
used as line ending char it was VOLT (which is a Windows tool written by MS, go figure), so not relevant here. But even though it is 2021 Postel's Law might still apply to this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm 234287e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think VTT table dumps (in fontTools' TTX format) contain \r
, but the two characters and not \u000D
. I think the data folder shouldn't be touched anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait. It actually contains literal \u000D
, see https://github.com/daltonmaag/ubuntu/blob/master/source/Ubuntu-B.ufo/data/com.github.fonttools.ttx/T_S_I__1.ttx. Maybe because VTT started life on very old Macs that still used \r
?
I suppose mixing LF and CRLF in any part of the UFO (glyph lib or elsewhere) is a bad idea so I guess this PR is fine, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data will be read in as byte arrays?
Maybe because VTT started life on very old Macs that still used \r?
I did not know this!
I benched the String::contains check because I was concerned that in the worst case scenario there would be two full document char traversals to achieve the replacement. I think that it is worth keeping because this will be run over every fea file at serialization and the most common case will be no carriage returns in the file. In that situation the code runs significantly faster with the contains char check than without it. Here are the data where contains=includes the contains check and no contains=running the string replacement over the string without the check: File without carriage returns (most common case)
File with carriage return + line feed on the last line only (worst case, but highly unlikely)
File with carriage return + line feed on all lines (most likely case when carriage returns are used)
Notably the performance improvement is contingent on using a char in the contains check. If you convert it to a string slice there is a significant performance impact a/w the contains check in the worst case scenario (~ 2-fold slower vs. not using the check). The overall impact is clearly low. This is one file/fea String per UFO master with differences in the microsecond range. |
6548bb3
to
234287e
Compare
Sorry to let this sit, my attention has been elsewhere. I'm off the computer shortly for some irl appointments but I'll try to take a look tonight or tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me!
src/font.rs
Outdated
fs::write(path.join(FEATURES_FILE), features)?; | ||
// Normalize feature files with line feed line endings | ||
// This is consistent with the line endings serialized in glif and plist files | ||
if features.contains('\r') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought, and maybe the compiler is smart enough to optimize this, but I would prefer to scan for the \r
byte over the \r
char; scanning chars means decoding utf-8, whereas scanning bytes doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only other thought (and this is totally unnecessary) is that if we really wanted to be perf nerds, the better approach would probably be to do this replacement when reading, instead of when writing.
Basically:
- instead of something like
fs::read_to_string
, manually open the fea as aFile
- allocate two buffers
- read chunks from the file into the first buffer. on each read, scan the chunk for
\r
- if
\r
is encountered, then manually recopy the lines into the second buffer, skipping an\r
- if
\r
is never encountered, just return the first buffer untouched as the result, else return the second buffer.
this is dumb and definitely not worth it. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanning chars means decoding utf-8
I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.
fn load_features(features_path: &Path) -> Result<String, Error> {
let features = fs::read_to_string(features_path)?;
Ok(features)
}
Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to scan for the \r byte over the \r char
Are you referring to something along the lines of searching for b"\r"
in features.as_bytes()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanning chars means decoding utf-8
what I mean is simply that the chars() iter has to determine character boundaries in the underlying bytes, which has some overhead. This is on top of the overhead of validating the utf-8 when it's first read.
I was wondering about the read approach when I worked on this. It looks like the feature read is to String so the UTF-8 check happens read side.
Yes, I believe we read directly to string currently, which handles validation. We could alternatively read to bytes and then validate ourselves, to equal effect.
fn load_features(features_path: &Path) -> Result<String, Error> { let features = fs::read_to_string(features_path)?; Ok(features) }Does the feature file data need to be a UTF-8 vetted String? On the one hand, I suppose that it provides a built-in encoding linting check at read time. On the other, it may not be necessary unless there are String'y things that will need to be done with it in the library. Right now it appears that feature data are exposed as is to users in a String without other feature support.
This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converted to a byte check in 2e84bce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question, which is not answered by the spec. The spec does mention that strings are in utf-8, and I would prefer to assume that the whole file is, as well. It might be worth opening an issue in the Adobe repo...
Response from Josh Hadley (Adobe):
Yes, FEA files should indeed use UTF-8 encoding throughout. This has been an unwritten (well…partially written) assumption for some time now but as of the latest AFDKO (3.7.1), that assumption is enforced by the new Antlr4-based parser in makeotfexe. Older versions of makeotfexe might have been lax on that point, but that will no longer be the case moving forward. Also worth noting that makeotf (the Python interface to makeotfexe) has effectively enforced this for a while already.
But as you note, the OpenType Feature File Specification does not actually explicitly state that the whole file should be UTF-8. We will clarify that in an update soon.
…dings fix test name update expected fea file format
all carriage returns are removed across all platforms
…e endings confirm that we convert these to \n in plist files on all platforms
234287e
to
231ab61
Compare
New benchmarks with conversion to a byte check from char check (#172 (comment)):
Appears to have the same performance within the margin of error. |
@chrissimpkins can you please clarify how I need to read your benchmark results? |
#172 (comment) is the original run with a description. The latest tests added a third test group with the new approach to check for the carriage return byte rather than char. I ran tests against test strings that included:
The test is whether the contains check before replacement is useful in any of the above circumstances and these tests were grouped as follows:
Full repro source in the attached archive |
IMO the tl;dr takeaway is that, in the vastly most common case of fea files with line feeds only, there is an improvement in execution time if you do gate the string replacement on a CR check (~250ns/iter vs. ~1000ns/iter). It appears that char and byte checks run in ~ the same time. All tested on MBP 2017 15" macOS 10.15.7 with current rustc nightly |
We are dealing with very short times and max one string per ufo master so this is not going to break anything. Happy to remove the check if anyone feels strongly about more concise/clean source. It was not intuitive to me that this would be the finding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for measuring!
Came across this in the Adobe OT fea spec documentation:
Recording it here for posterity |
Supersedes #170
Based on discussion in #162, we will write line feeds in all UFO files on all platforms by default.
This branch is based on #171 (edit: #171 now merged into master branch)
TODO
Confirm that line feeds are serialized by default in: