feat: Generic logline placeholder replacement and tokenization #12799

na-- · 2024-04-26T07:28:34Z

What this PR does / why we need it:

This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm.

Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split {"key":value} pairs (without actually parsing the JSON).

All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm. Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split `{"key":value}` pairs (without actually parsing the JSON). All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.

benclive

Tough to review but as far as I can tell you've already covered the vast majority of test cases and it seems to work well. I have some concerns about maintainability but I could follow along & it is well commented so I think it's as good as it can be.
The rest of my comments are relatively minor so I'm happy to approve this. Nice job, very impressive implementation!