You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Various parts of ledger pass values of type char, derived from arbitrary user input, into the standard C++ functions std::isspace, std::isalpha, &c. (which are just copies of the standard C <ctype.h> functions). For example:
The ctype functions take inputs of type int. But when the input to the functions is not representable by type unsigned char -- that is, when the value represented by the object of type char is negative -- and the value is not the value of the integer constant macro EOF, this is undefined behaviour:
The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
For example, given char x[2] = {0xa2,0}, which represents a NUL-terminated string of POUND SIGN (£) in the locale en_GB.ISO8859-1, on platforms where char is signed such as the usual amd64 ABI and EOF is -1, isspace(x[0]) is undefined behaviour because the value of x[0] is -94, which is neither the value of EOF nor representable by the type unsigned char. (It can be converted to unsigned char, giving the different number 162.)
The undefined behaviour may manifest as silent wrong answers, as in #2338, or as crashes if the virtual page preceding a ctype table is mapped unreadable. Even in cases where it is not undefined behaviour, the answer may be wrong. For example, given char x[2] = {0xff, 0}, which might represent a NUL-terminated string of LATIN SMALL LETTER Y WITH DIAERESIS (ÿ) in the locale fr_FR.ISO8859-1, isalpha(x[0]) will return the same result as isspace(EOF) on most systems (where EOF is -1), giving false when the correct answer is true.
The reason for this peculiar interface is that the ctype functions were intended to handle the results of C functions like fgetc, or C++ functions like std::istream::peek, which return values of type int, which can be either EOF (typically -1) or a value that is representable in unsigned char (0, 1, 2, ..., 255):
std::istream in = ...;
int ch;
while ((ch = in.peek()) != EOF) {
if (std::isspace(ch)) ...
}
Using them to process elements of a char array or std::string generally requires casting the char to unsigned char first:
std::string s = ...;
if (std::isspace(s[i])) { ... } // almost always wrong
if (std::isspace(static_cast<unsigned char>(s[i]))) { ... } // guaranteed safe
Even in cases where an input of type char might be guaranteed, by the surrounding context, to lie in the range {0, 1, 2, ..., 127} instead of {-128, -127, ..., -1, 0, 1, 2, ..., 127}, it is (a) still safe to cast to unsigned char first, (b) easier to audit.
(Some uses of the ctype macros might be inappropriate anyway to process UTF-8-encoded input (or possibly other multibyte input, if ledger supports that). But, if that is an issue, it is a different issue that can be resolved separately. This issue is just about the abuse of ctype functions for inputs of type char.)
The text was updated successfully, but these errors were encountered:
riastradh
pushed a commit
to riastradh/ledger
that referenced
this issue
Apr 27, 2024
Various parts of ledger pass values of type char, derived from arbitrary user input, into the standard C++ functions std::isspace, std::isalpha, &c. (which are just copies of the standard C <ctype.h> functions). For example:
ledger/src/textual.cc
Line 343 in c679e3c
Here
line[...]
has type char.The ctype functions take inputs of type int. But when the input to the functions is not representable by type unsigned char -- that is, when the value represented by the object of type char is negative -- and the value is not the value of the integer constant macro EOF, this is undefined behaviour:
For example, given
char x[2] = {0xa2,0}
, which represents a NUL-terminated string of POUND SIGN (£) in the locale en_GB.ISO8859-1, on platforms where char is signed such as the usual amd64 ABI and EOF is -1,isspace(x[0])
is undefined behaviour because the value ofx[0]
is -94, which is neither the value of EOF nor representable by the type unsigned char. (It can be converted to unsigned char, giving the different number 162.)The undefined behaviour may manifest as silent wrong answers, as in #2338, or as crashes if the virtual page preceding a ctype table is mapped unreadable. Even in cases where it is not undefined behaviour, the answer may be wrong. For example, given
char x[2] = {0xff, 0}
, which might represent a NUL-terminated string of LATIN SMALL LETTER Y WITH DIAERESIS (ÿ) in the locale fr_FR.ISO8859-1,isalpha(x[0])
will return the same result asisspace(EOF)
on most systems (where EOF is -1), giving false when the correct answer is true.The reason for this peculiar interface is that the ctype functions were intended to handle the results of C functions like fgetc, or C++ functions like std::istream::peek, which return values of type int, which can be either EOF (typically -1) or a value that is representable in unsigned char (0, 1, 2, ..., 255):
Using them to process elements of a char array or std::string generally requires casting the char to unsigned char first:
Even in cases where an input of type char might be guaranteed, by the surrounding context, to lie in the range {0, 1, 2, ..., 127} instead of {-128, -127, ..., -1, 0, 1, 2, ..., 127}, it is (a) still safe to cast to unsigned char first, (b) easier to audit.
(Some uses of the ctype macros might be inappropriate anyway to process UTF-8-encoded input (or possibly other multibyte input, if ledger supports that). But, if that is an issue, it is a different issue that can be resolved separately. This issue is just about the abuse of ctype functions for inputs of type char.)
The text was updated successfully, but these errors were encountered: