Skip to content

matklad/window

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Window --- constant-time live grep

I can't grep, so I wrote this tool to help me out with the following problem:

I have a large (10s of gigabytes) log of a deterministic simulation of a distributed system. The log captures some failure (cluster fails to converge or some such nastiness). Among the million of messages which capture everything that happened to the cluster, I want to quickly identify the small subset that explains the failure.

In theory, this is solvable on the command line using the right combination of grep, tail, head and such, but I am not comfortable with coreutils to actually be able to use these tools efficiently.

Enter window:

Screencast_20240125_113014-1.webm

On the left, there's a file with a filtering query. On the right --- the result of applying the filter to the entire log file. Changing the query automatically updates results.

Secret sauce: window looks only at the constant amout of bytes from the input, and the size of the output is also bound by a constant:

source_bytes_max = 104857600
target_bytes_max = 102400

That means:

  • Its on the user to positon this 100MiB "window" above the interesting part of the log file
  • In exchange, processing is constant time --- a query with low specificity returns an empty result quickly, instead of grinding the computer to a halt.

Usage Instrutions

See the source code: at the moment, this is a tool for my personal use, so I haven't invested any time into making it user friendly, beyond writing this readme.

Design Notes

window combines insights from rust-analyzer and TigerBeetle:

  • target_bytes_max comes from rust-analyzer --- the end consumer of any tool's output is a human. Humans are notoriously slow and are incapable of processing more than a hundred lines at a time. So, any processing should be hard-capped to produce at most a screenfull of output. If that's not enough, the correct solution is for the user to refine the query, rather than for the computer to burry the human in a needle-laced haystack.

  • source_bytes_max comes from TigerBeetle --- although default programming model is to treat all resources as virtually unlimited (the only enforced limit is the 64-bit size of the address space), any real software running on a real hardware will have some limits on the size of the dataset it can process. By starting from the explicit limits, the software can become qualitatively more reliable, and usually much snappier as well!

See https://matklad.github.io/2024/02/10/window-live-constant-time-grep.html for an expanded discussion.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages