Would be handy if you could pass log lines through a custom function before submitting them. For example for anonymizing sensitive data / PII, etc
@beaufour There's currently no way to do this, and I wanted to see how you wanted to use it (or would if this was available). My first thought is that the best bets are either:
@troy I was thinking along those lines too, but in an (read: my :) ) ideal world, I would want to keep my raw and unmodified log files as they are now on the local system. Then only for the transfer to a third party system would I modify or strip out stuff like api keys, ip addresses, and other sensitive information. Either for security or regulatory purposes.
Thanks, @beaufour. This inspired me to write down why I think that's a bad approach :-) There's nothing inherently more or less secure about transferring logs to another system - whether your own log aggregator, a nightly batch copy (like with rsync), a hosted logging service, a hosted file storage service like S3, or a long-term reporting service. Assuming there's trust attached to these keys, either the logging entity is comfortable accumulating that trust - and basically creating a central repository for it - or it's not.
The rest should be based on the specifics of the situation (security of that solution and the degree, amount, and duration of trust), not where it is or who runs it.
My own take on that is that it's never worth accumulating a constantly-increasing pile of trust. For example, there's almost no case where 80% of an API token is not unique enough to easily discern the requestor, but complex enough to make the log data basically useless. (NIST has a decent doc on tactics for this.)
But regardless of how much risk/exposure one is willing to accept, I think differentiating between local and remote, or decentralized and centralized, or locally-operated and hosted, is the wrong approach. To boil it down, either getting hacked and having those logs exposed is awful but tolerable, or it's not and they shouldn't be logged (or at least, retained). Sooner or later, that will happen.
I think an expectation that any system can, and with enough time, will, get hacked is also a good litmus test for what one actually cares about stripping. In most cases, IPs aren't worth the effort. In 99.9% of situations, if those got out, it would be annoying but not really impact anything. Same for arbitrary user IDs (as in 1843243, not emails) and the like. OTOH, API keys and maybe email addresses seem totally worth some de-identification, depending on whose they are and how tough they are to expire.
I do see your viewpoint, but disagree on most of it. A short use case could be: keeping IPs and other things local on the system for a short period of time (max 1-2 days) so if something happens you can diagnose it. That doesn't mean that I ever want it to leave the system. On anonymizing the data, you could cut off the last two digits of an IP, or one way hash the API key, etc. Plenty of good ways.
I wholeheartedly disagree on transferring data doesn't change the security too. I don't trust all systems equally. Neither externally or internally.
But thanks for the very thorough response on this!
You're very welcome for my thought process here. It's how we work, and I appreciate your comments. I think what you're describing is going to be a better fit for rsyslog, syslog-ng, or an intermediate post-processor/re-writer. remote_syslog is pretty lean and mean, or tries to be.
Also, one tiny clarification. I don't trust all systems equally, I just wouldn't inherently trust one more or less because it's local (or on another machine, or in another country). The specifics are everything.