Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easylogging++: sanitize log payload #6550

Merged
merged 1 commit into from
Jul 8, 2020

Conversation

moneromooo-monero
Copy link
Collaborator

Some of it might be coming from untrusted sources

Copy link
Contributor

@vtnerd vtnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started this review but then stopped. Why is this being done for every log message instead of the offending ones? This requires an additional string copy for every log message. This is infrequent (logging shouldn't spam too much), but how many times is some "unsanitized" (not printable?) message being logged? And what happens when the users locale isn't utf8?

}

template<typename T>
inline T utf8canonical(const T &s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload is unnecessary due to the default argument.

{

template<typename T, typename Transform>
inline T utf8canonical(const T &s, Transform t = [](wint_t c)->wint_t { return c; })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this entire function repeated in this header, and the other cpp? Was it supposed to be removed from the other cpp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because easylogging++ does not depend on common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some independent thoughts:

  • Why have template type T? Each "copy" of the function uses a single type.
  • Using/instantiating templated code doesn't create a library dependency.
  • The mnemonics code already depends on epee and easylogging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the original used string and wipeable_string. I'll remove the template on that one.

Are you telling me "you could #include "mnemonics/something" or even "common/something" in easylogging ? That'd be gross.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or the opposite - mnemonics could open something in easylogging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really belong in a logging API though. If you feel strongly about it, I guess I can do it that way.

default: throw std::runtime_error("Invalid UTF-8");
}
*wptr = 0;
sc += T(wbuf, bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use .append() here. The code already assumes a particular constructor and operator+= interface; append is usually going to be more efficient.

void sanitize(std::string &s)
{
s = utf8canonical(s, [](wint_t c)->wint_t {
if (c == 9 || c == 10 || c == 13)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to do here? std::isprint ? There's also boost::spirit::char_encoding::unicode::isprint which works specifically with unicode points regardless of locale settings. There isn't a quick/guaranteed isprint for unicode in C++11.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks like this is filtering for printable ascii and only space characters? This check might be off for upper characters I'd have to look again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping stuff we want to get in logs, but not control characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on whats being allowed?

@moneromooo-monero
Copy link
Collaborator Author

It avoids going through all the code to replace LOG(s) with LOG(sanitize_for_log(s)). Which is annoying, might miss some, and does not ensure future ones don't get added.

@moneromooo-monero
Copy link
Collaborator Author

I guess later utf8canonical could be changed to be two pass, first pass checks, second pass happens only if the first pass detected the string should change and makes the change.

{
std::string sc = "";
size_t avail = s.size();
const char *ptr = s.data();
Copy link
Contributor

@vtnerd vtnerd May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're going with the two copies? Ahhh ... well at least in-place overwrites is guaranteed to work since the filtering function returns either the same codepoint or a single-byte codepoint (?).

boost::locale::utf has most of this implemented, and the documentation states linking is not required (header only). This would make an in-place implementation easier to read I think, something like:

char* wptr = &str[0];
const char* rptr = wptr;
char const* const end = rptr + str.size();
while ( rptr != end)
{
  using utf8 = boost::locale::utf::utf_traits<char>;

  const bool copy = rptr != wptr;
  const auto cp = utf8::decode(rptr, end);
  if (!boost::locale::utf::is_valid_codepoint(cp))
     throw std::runtime_error{...};

  // only works if log_filter returns codepoint equal or
  // less in length (always true currently)
  const auto filtered = filter_for_log(cp);
  if (filtered != cp || copy)
    wptr = utf8::encode(filtered, wptr);
  else
    wptr = rptr;
}

The switch in the boost implementation is a bit more compact, and contains a few more checks for invalid utf8 sequences. Regardless, the in-place version is doable once the filtering function is static/not mutable.

@@ -2475,6 +2475,100 @@ void DefaultLogDispatchCallback::handle(const LogDispatchData* data) {
}
}


template<typename Transform>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Transform function is also fixed here since this is just a straight copy. Just move the lambda into its own function and call filter_codepoint or whatever from inside the function.

template<typename Transform>
static inline std::string utf8canonical(const std::string &s, Transform t = [](wint_t c)->wint_t { return c; })
{
std::string sc = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use default construction. It might skip std::strlen calls, but there isn't a reason to leave it to chance.

@@ -73,78 +74,11 @@ namespace Language
return prefix;
}

template<typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was moved during the initial refactor?

Currently the only new call to this function is using an entirely different copy of the same function. Switching to a templated Transform function also doesn't seem useful. The T template is also (has always been) fixed according to my quick grepping.

{

template<typename T, typename Transform>
inline T utf8canonical(const T &s, Transform t = [](wint_t c)->wint_t { return c; })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or the opposite - mnemonics could open something in easylogging.

struct WordHash
{
std::size_t operator()(const epee::wipeable_string &s) const
{
const epee::wipeable_string sc = utf8canonical(s);
const epee::wipeable_string sc = tools::utf8canonical(s, [](wint_t c) -> wint_t { return std::tolower(c); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calls to std::tolower are (and have been previously) invalid since simple wallet changes to the user environment locale which may not be unicode compatible. Presumably, everyone has been using utf8 encodings in their locale, or this would've failed. A comment or Github issue? If someone hits this edge case, lots should go wrong in the code I think.

There's boost::spirit::char_encoding::unicode::tolower (from the same file mentioned previously), but that drags in unicode tables from Boost. I don't see any other easy method, with Boost or C++, as you have to specify some kind of correct locale which is funky and may not be present. Boost locale can convert the entire string, but it only returns std::string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

towlower might work ? I just saw it in the manpges, takes a wint_t, says "in the current locale", no mention of UTF-8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

towlower is better but still has the same user environment locale/encoding issue. If the user environment uses shift_jis, then all of this is incorrect. Although, the wallet is probably unusable in that environment (i.e. scope larger than just this snippet), and no one has filed any bug reports so (nearly) everyone uses utf8 presumable.

I commented on it because the CLI output is probably going to be messed up in that situation, and it may not be immediately obvious why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use towlower then. Whatever improvements can be done later if you find a good way (or boost).

void sanitize(std::string &s)
{
s = utf8canonical(s, [](wint_t c)->wint_t {
if (c == 9 || c == 10 || c == 13)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on whats being allowed?

@moneromooo-monero
Copy link
Collaborator Author

What kind of comment are you after ? Those just make sense as they're already logged.

@moneromooo-monero
Copy link
Collaborator Author

I'm just going to ignore the bikeshedding. If you want to change, the code to use boost or add micro optimisations, fine, but I'm not going to do it here.

@vtnerd
Copy link
Contributor

vtnerd commented May 18, 2020

I'm just going to ignore the bikeshedding. If you want to change, the code to use boost or add micro optimisations, fine, but I'm not going to do it here.

This isn't bikeshedding or micro-optimizing. The codebase routinely has large functions with many variables instead of smaller re-usable functions. Ideally we move in the other direction. And this PR duplicates entire functions within the same codebase. This PR is to the master branch, not the 0.16 variant - I'm not sure why there's a need to accept this code with some performance impact when an alternative implementation isn't too difficult.

This patch copies+allocates a new string for every log message, while a lock is being held. Log messages are rare, but the same lock must be acquired (basically) for all error/warning/info log statements to determine whether a message is to be displayed (its worse if the user bumps up logging in one category).

@moneromooo-monero
Copy link
Collaborator Author

This is not new code, in case you did not notice. It was just moved (the duplication is unfortunate, though I agree). Stuff like changing the ctor to take out a strlen on a "" is time wasting though.

Some of it might be coming from untrusted sources

Reported by itsunixiknowthis
@luigi1111 luigi1111 merged commit ee817e0 into monero-project:master Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants