-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid ctype abuse. #2341
Avoid ctype abuse. #2341
Conversation
beb7ca7
to
789d306
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @riastradh, very much appreciated!
Would you be willing to also add tests for the changes? I think that tests would help explain the issue(s) with the code prior to this PR and avoid regressions in the future.
@@ -1034,7 +1035,7 @@ bool amount_t::parse(std::istream& in, const parse_flags_t& flags) | |||
commodity_t::parse_symbol(in, symbol); | |||
|
|||
if (! in.eof() && ((n = static_cast<char>(in.peek())) != '\n')) { | |||
if (std::isspace(static_cast<char>(in.peek()))) | |||
if (std::isspace(in.peek())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more context please, as to why the static_cast
should be removed here, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we're on a platform where char is signed and EOF is -1.
in.peek() returns an int, and the set of values that it can return is: {-1, 0, 1, 2, 3, 4, ..., 254, 255}. (That's 257 distinct possible values.) This set of values is always valid for std::isspace -- they're designed to be used together like this! -- so the cast is unnecessary.
What happens if we cast anyway?
Of that set, values in the subset {128, 129, 130, ..., 254, 255} are mapped by static_cast<char> to {-128, -127, -126, ..., -2, -1}.
Of that set, only -1 (coming from 255) is a valid input to std::isspace, because it coincides with the value of EOF, but that's probably not what is intended. (It might work by accident here, but if we were asking std::isalpha in an ISO-8859-1 locale, it should return true for 255 but false for EOF.)
All the other possible values 128 and over that in.peek() can return lead to passing values in {-128, -127, ..., -3, -2} to std::isspace, which is undefined behaviour.
So not only is the static_cast<char> here unnecessary, but it's actually harmful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's helpful, thanks! Do you know any platforms where EOF is -1? Once I again, I'd like to kindly ask for tests that catch ledger's current misbehaviour and ensures regressions to this fix are caught. Would you be willing to do that? If there are questions on how to write tests for ledger, I'm happy to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know any platforms where EOF is -1?
I don't know any platforms where EOF isn't -1. In principle such platforms might exist and still conform to the C standard. But it's -1 on all the BSDs, on glibc, and on macOS.
*r++ = static_cast<char>(std::tolower( | ||
static_cast<unsigned char>(*q))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is first casting to unsigned char
and then casting to char
intentional here? If yes, what are the implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intentional because the only change I made was to replace std::tolower(*q) by std::tolower(static_cast<unsigned char>(*q)); I preserved the context.
It may be that you can safely remove the static_cast<char>(...) around it. EOF is not a possible output of tolower here, because EOF is not a possible input coming via static_cast<unsigned char>(...), so there's no risk of losing information from a non-injective conversion of int to char. But it might trigger warnings about implicit signedness conversions.
I left it as is to make auditing this change easier: keeping the outer static_cast<char> can't make the code worse. If you want to remove it, that's probably fine; I just felt it would be better to defer to a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to reply in detail, @riastradh. I agree with your reasoning.
It would be nice to make sure all of these cases are exercised, and normally for bug fixes like this I would prefer to add an xfail test in one commit and fix the bug in another commit. However, the symptom of the undefined behaviour is nondeterministic: you're reading out whatever random data happens to precede the ctype/tolower/toupper tables in memory. So it might sometimes return the right answers, and might sometimes return the wrong ones. Or it might crash. Or demons might fly out of your nose. But I don't think the ledger test harness has a way to verify nasal demons. That said: the tests have been failing on NetBSD for a while, and I just never got around to tracking down the cause. It turns out that this change together with one other fix (for a GCC libstdc++ bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114879, which breaks std::ios::sync_with_stdio(false) on NetBSD, https://gnats.NetBSD.org/58206) resolves all the test failures on my NetBSD system. I'm also considering committing changes to NetBSD to guarantee that ctype abuse triggers immediate SIGSEGV, rather than silent data corruption (https://gnats.NetBSD.org/58208). So while the existing ledger tests might not exercise all possible ctype abuse (and there's a lot -- there are many calls to ctype functions in ledger), they do seem to exercise enough to raise an issue. |
Ping? Is there any chance of merging this? The bug (#2340) is causing actual problems in my real-world use of ledger to maintain records with Chinese and Japanese names (#2338), so I have to carry around the patch and I would like to avoid carrying it around. The existing automatic tests already catch many instances of the problem on NetBSD. |
I'd love to merge it, but there is a CI build failure yet to be addressed. |
789d306
to
d87dee4
Compare
Thanks! I checked the build failure at https://github.com/ledger/ledger/actions/runs/8862348541/job/24350665848?pr=2341 and all I found was:
The build appears to have taken 1s, so I don't think it actually got to any part of the build that this change affected -- more likely a transient failure in github actions. So I updated the patch and pushed again, but it looks like it requires maintainer approval to rerun the build. Can you push that button? |
Thanks, looks like it was, as I guessed, a transient github actions failure -- all checks passed this time around. |
fix #2338
fix #2340