Provide a way to modify lines before submitting them #51

Closed
beaufour opened this Issue Mar 15, 2013 · 5 comments

Comments

Projects
None yet
2 participants

Would be handy if you could pass log lines through a custom function before submitting them. For example for anonymizing sensitive data / PII, etc

Owner

troy commented Sep 3, 2013

@beaufour There's currently no way to do this, and I wanted to see how you wanted to use it (or would if this was available). My first thought is that the best bets are either:

  • Log that from the framework. For example, in Rails, adding the lograge Gem or any other one which supports custom attribute logging, filtering the full api_key parameter with parameter filtering, and then logging your modified parameter instead.
    This seems like a great end result, in that it's flexible, totally customizable (for the right level of anonymity on an attribute-specific basis), and no harder to setup - probably easier - than the equivalent would be in remote_syslog.
  • Use an rsyslog or syslog-ng hack to rewrite it. Neither of these support arbitrary rewrites all that well, but they both can be made to do it.

beaufour commented Sep 4, 2013

@troy I was thinking along those lines too, but in an (read: my :) ) ideal world, I would want to keep my raw and unmodified log files as they are now on the local system. Then only for the transfer to a third party system would I modify or strip out stuff like api keys, ip addresses, and other sensitive information. Either for security or regulatory purposes.

Owner

troy commented Sep 4, 2013

Thanks, @beaufour. This inspired me to write down why I think that's a bad approach :-) There's nothing inherently more or less secure about transferring logs to another system - whether your own log aggregator, a nightly batch copy (like with rsync), a hosted logging service, a hosted file storage service like S3, or a long-term reporting service. Assuming there's trust attached to these keys, either the logging entity is comfortable accumulating that trust - and basically creating a central repository for it - or it's not.

The rest should be based on the specifics of the situation (security of that solution and the degree, amount, and duration of trust), not where it is or who runs it.

My own take on that is that it's never worth accumulating a constantly-increasing pile of trust. For example, there's almost no case where 80% of an API token is not unique enough to easily discern the requestor, but complex enough to make the log data basically useless. (NIST has a decent doc on tactics for this.)

But regardless of how much risk/exposure one is willing to accept, I think differentiating between local and remote, or decentralized and centralized, or locally-operated and hosted, is the wrong approach. To boil it down, either getting hacked and having those logs exposed is awful but tolerable, or it's not and they shouldn't be logged (or at least, retained). Sooner or later, that will happen.

I think an expectation that any system can, and with enough time, will, get hacked is also a good litmus test for what one actually cares about stripping. In most cases, IPs aren't worth the effort. In 99.9% of situations, if those got out, it would be annoying but not really impact anything. Same for arbitrary user IDs (as in 1843243, not emails) and the like. OTOH, API keys and maybe email addresses seem totally worth some de-identification, depending on whose they are and how tough they are to expire.

beaufour commented Sep 4, 2013

I do see your viewpoint, but disagree on most of it. A short use case could be: keeping IPs and other things local on the system for a short period of time (max 1-2 days) so if something happens you can diagnose it. That doesn't mean that I ever want it to leave the system. On anonymizing the data, you could cut off the last two digits of an IP, or one way hash the API key, etc. Plenty of good ways.

I wholeheartedly disagree on transferring data doesn't change the security too. I don't trust all systems equally. Neither externally or internally.

But thanks for the very thorough response on this!

troy was assigned Sep 4, 2013

Owner

troy commented Sep 17, 2013

Hi @beaufour,

You're very welcome for my thought process here. It's how we work, and I appreciate your comments. I think what you're describing is going to be a better fit for rsyslog, syslog-ng, or an intermediate post-processor/re-writer. remote_syslog is pretty lean and mean, or tries to be.

Also, one tiny clarification. I don't trust all systems equally, I just wouldn't inherently trust one more or less because it's local (or on another machine, or in another country). The specifics are everything.

Thanks,

Troy

troy closed this Sep 17, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment