Skip to content

Commit

Permalink
feat: Generic logline placeholder replacement and tokenization (#12799)
Browse files Browse the repository at this point in the history
This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm.

Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split `{"key":value}` pairs (without actually parsing the JSON).

All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and allocations, and should handle all log formats and unformatted logs equally well.
  • Loading branch information
na-- committed Apr 26, 2024
1 parent 151d0a5 commit 4047902
Show file tree
Hide file tree
Showing 4 changed files with 1,462 additions and 0 deletions.
Loading

0 comments on commit 4047902

Please sign in to comment.