A streaming plain-text log analyzer for very large files (GB-scale), focused on metrics that help tune Fluent Bit parsing.
- Throughput: elapsed wall-clock time, MB/s, lines/s
- Line shape: min/avg/max length, percentiles (p50/p90/p95/p99), very-long-line counts
- Timestamp detection: presence, format breakdown (ISO 8601, Apache, epoch seconds/millis, syslog), earliest/latest timestamp range
- Severity tokens:
trace/debug/info/warn/error/fatal - JSON analysis: JSON-like line ratio and parse failures
- Parse-failure hints: lines matching parse/malformed/invalid/unmarshal patterns, with top recurring signatures
- Error hints: top recurring error signatures
- Multiline block analysis: consecutive non-timestamp lines — frequency, line/byte length stats, heuristic format detection (
json,yaml_like,xml_like,stacktrace_like,plain_text), and one example block per format - Duplicate line detection: top-K most repeated lines
- Sample lines: lines with parse-failure or orphaned error hints
- Custom regex counters via
--pattern - Colored terminal output (auto-detected,
--no-colorto disable) - Fluent Bit configuration recommendations: optimal
Buffer_Size,Mem_Buf_Limit,Time_Format, multiline parser settings, and advisory notes — all derived automatically from the analysis
cargo build --releaseBinary: target/release/log-analyzer
Analyze one or more files:
./target/release/log-analyzer /path/to/app.log /path/to/worker.logGzipped logs (auto-detected by .gz extension):
./target/release/log-analyzer /path/to/app.log.gzFrom stdin:
cat /path/to/app.log | ./target/release/log-analyzer -Track specific patterns:
./target/release/log-analyzer app.log \
--pattern 'cannot parse' \
--pattern 'invalid time format' \
--pattern 'multiline'Export machine-readable JSON:
./target/release/log-analyzer app.log --json-out report.jsonControl report limits:
./target/release/log-analyzer app.log \
--sample-limit 50 \
--top-k 30 \
--max-signatures 50000 \
--top-duplicates 20Disable color or progress:
./target/release/log-analyzer app.log --no-color
./target/release/log-analyzer app.log --no-progress
./target/release/log-analyzer app.log --progress-interval 0.5| Flag | Default | Description |
|---|---|---|
--json-out <FILE> |
— | Write JSON report to file |
--pattern <REGEX> |
— | Custom regex to count (repeatable) |
--sample-limit <N> |
20 | Max sample lines to collect |
--top-k <N> |
20 | Top-K signatures to show |
--max-signatures <N> |
10000 | Max unique signatures to track |
--top-duplicates <N> |
10 | Top-N duplicate lines to show |
--progress-interval <S> |
1.0 | Progress update interval (seconds) |
--no-progress |
false | Suppress progress output on stderr |
--no-color |
false | Disable colored output |
- Processing is streaming and single-pass; memory is bounded via signature caps.
- Text decoding uses UTF-8 with replacement for invalid bytes.
- The tool does not modify input files.
- Progress is printed to
stderrand includes ETA for regular files with known size. - For
stdinand.gzinputs, ETA is shown asn/abecause total uncompressed size is not known up front. - Local benchmark: ~1.98 GiB log file processed in ~5.5 s (~369 MiB/s, ~953 k lines/s).
Every report ends with a Fluent Bit Configuration Recommendations section that suggests concrete settings derived from the log analysis:
| Setting | How it is derived |
|---|---|
Buffer_Size |
Next power-of-two above the longest line + 1 KB framing overhead |
Mem_Buf_Limit |
Based on estimated real-world ingestion rate × 60 s headroom, clamped to [8 MB, 512 MB] |
storage.max_chunks_up |
Mem_Buf_Limit ÷ 2 MB chunk size, rounded to a power of two (≥ 128) |
Format |
json if >50% of lines are JSON-like, otherwise regex |
Time_Format |
Fluent Bit strftime string matching the dominant timestamp format |
flush_timeout |
Scaled from multiline block depth (1–5 s) |
parser_firstline |
Regex anchored to the dominant timestamp pattern |
Advisory notes flag issues such as mixed timestamp formats, very long lines, high multiline ratios, invalid UTF-8, and parse-failure patterns.
The same data is available in the fluent_bit key of the --json-out output.