-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid ctype abuse. #2341
base: master
Are you sure you want to change the base?
Avoid ctype abuse. #2341
Conversation
9aa4116
to
beb7ca7
Compare
beb7ca7
to
789d306
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @riastradh, very much appreciated!
Would you be willing to also add tests for the changes? I think that tests would help explain the issue(s) with the code prior to this PR and avoid regressions in the future.
@@ -1034,7 +1035,7 @@ bool amount_t::parse(std::istream& in, const parse_flags_t& flags) | |||
commodity_t::parse_symbol(in, symbol); | |||
|
|||
if (! in.eof() && ((n = static_cast<char>(in.peek())) != '\n')) { | |||
if (std::isspace(static_cast<char>(in.peek()))) | |||
if (std::isspace(in.peek())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more context please, as to why the static_cast
should be removed here, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we're on a platform where char is signed and EOF is -1.
in.peek() returns an int, and the set of values that it can return is: {-1, 0, 1, 2, 3, 4, ..., 254, 255}. (That's 257 distinct possible values.) This set of values is always valid for std::isspace -- they're designed to be used together like this! -- so the cast is unnecessary.
What happens if we cast anyway?
Of that set, values in the subset {128, 129, 130, ..., 254, 255} are mapped by static_cast<char> to {-128, -127, -126, ..., -2, -1}.
Of that set, only -1 (coming from 255) is a valid input to std::isspace, because it coincides with the value of EOF, but that's probably not what is intended. (It might work by accident here, but if we were asking std::isalpha in an ISO-8859-1 locale, it should return true for 255 but false for EOF.)
All the other possible values 128 and over that in.peek() can return lead to passing values in {-128, -127, ..., -3, -2} to std::isspace, which is undefined behaviour.
So not only is the static_cast<char> here unnecessary, but it's actually harmful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's helpful, thanks! Do you know any platforms where EOF is -1? Once I again, I'd like to kindly ask for tests that catch ledger's current misbehaviour and ensures regressions to this fix are caught. Would you be willing to do that? If there are questions on how to write tests for ledger, I'm happy to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know any platforms where EOF is -1?
I don't know any platforms where EOF isn't -1. In principle such platforms might exist and still conform to the C standard. But it's -1 on all the BSDs, on glibc, and on macOS.
*r++ = static_cast<char>(std::tolower( | ||
static_cast<unsigned char>(*q))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is first casting to unsigned char
and then casting to char
intentional here? If yes, what are the implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intentional because the only change I made was to replace std::tolower(*q) by std::tolower(static_cast<unsigned char>(*q)); I preserved the context.
It may be that you can safely remove the static_cast<char>(...) around it. EOF is not a possible output of tolower here, because EOF is not a possible input coming via static_cast<unsigned char>(...), so there's no risk of losing information from a non-injective conversion of int to char. But it might trigger warnings about implicit signedness conversions.
I left it as is to make auditing this change easier: keeping the outer static_cast<char> can't make the code worse. If you want to remove it, that's probably fine; I just felt it would be better to defer to a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to reply in detail, @riastradh. I agree with your reasoning.
It would be nice to make sure all of these cases are exercised, and normally for bug fixes like this I would prefer to add an xfail test in one commit and fix the bug in another commit. However, the symptom of the undefined behaviour is nondeterministic: you're reading out whatever random data happens to precede the ctype/tolower/toupper tables in memory. So it might sometimes return the right answers, and might sometimes return the wrong ones. Or it might crash. Or demons might fly out of your nose. But I don't think the ledger test harness has a way to verify nasal demons. That said: the tests have been failing on NetBSD for a while, and I just never got around to tracking down the cause. It turns out that this change together with one other fix (for a GCC libstdc++ bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114879, which breaks std::ios::sync_with_stdio(false) on NetBSD, https://gnats.NetBSD.org/58206) resolves all the test failures on my NetBSD system. I'm also considering committing changes to NetBSD to guarantee that ctype abuse triggers immediate SIGSEGV, rather than silent data corruption (https://gnats.NetBSD.org/58208). So while the existing ledger tests might not exercise all possible ctype abuse (and there's a lot -- there are many calls to ctype functions in ledger), they do seem to exercise enough to raise an issue. |
fix #2338
fix #2340