Skip to content

miku/filterline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

filterline filters a file by line numbers.

Taken from here. There's an awk version, too.

Installation

There are deb and rpm packages.

To build from source:

$ git clone https://github.com/miku/filterline.git
$ cd filterline
$ make

Usage

Note that line numbers (L) must be sorted and must not contain duplicates.

$ filterline
Usage: filterline FILE1 FILE2

FILE1: line numbers, FILE2: input file

$ cat fixtures/L
1
2
5
6

$ cat fixtures/F
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10

$ filterline fixtures/L fixtures/F
line 1
line 2
line 5
line 6

$ filterline <(echo 1 2 5 6) fixtures/F
line 1
line 2
line 5
line 6

Since 0.1.4, there is an -v flag to "invert" matches.

$ filterline -v <(echo 1 2 5 6) fixtures/F
line 3
line 4
line 7
line 8
line 9
line 10

Performance

Filtering out 10 million lines from a 1 billion lines file (14G) takes about 33 seconds (dropped caches, i7-2620M):

$ time filterline 10000000.L 1000000000.F > /dev/null
real    0m33.434s
user    0m25.334s
sys     0m5.920s

A similar awk script takes about 2-3 times longer.

Use case: data compaction

One use case for such a filter is data compaction. Imagine that you harvest an API every day and you keep the JSON responses in a log.

What is a log?

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time.

From: The Log: What every software engineer should know about real-time data's unifying abstraction

For simplicity let's think of the log as a file. So everytime you harvest the API, you just append to a file:

$ cat harvest-2015-06-01.ldj >> log.ldj
$ cat harvest-2015-06-02.ldj >> log.ldj
...

The API responses can contain entries that are new and entries which represent updates. If you want to answer the question:

What is the current state of each record?

... you would have to find the most recent version of each record in that log file. A typical solution would be to switch from a file to a database of sorts and do some kind of upsert.

But how about logs with 100M, 500M or billions of records? And what if you do not want to run extra component, like a database?

You can make this process a shell one-liner, and a reasonably fast one, too.

Data point: Crossref Snapshot

Crossref hosts a constantly evolving index of scholarly metadata, available via API. We use filterline to turn a sequence of hundreds of daily api updates into a single snapshot, via span-crossref-snapshot (more details):

$ filterline L <(zstd -dc -T0 data.ndj.zst) | zstd -c -T0 > snapshot.ndj.zst

             ^                  ^                             ^
             |                  |                             |
       lines to keep       ~1B+ records, 4T+             latest versions, ~140M records

Crunching through ~1B messages takes about 65 minutes, about 1GB/s.

Look, ma, just files.