How can can we filter .csv.gz files?
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
1gb.txt
README.md
download.sh
filter.go

README.md

This was a fun little team project to see how we can filter S3 inventory .csv.gz files fastest!

Other implementations:

Usage

# get some working data, downloads 1GB from S3 into testdata/ subdirectory
> ./download.sh


# Processing using a one file at a time
> go run ./filter.go


# Processing in parallel (workers = num cpus)
> GOPAR=1 go run ./filter.go

My results (on my late 2017 13" MBP)

Strategy: One file at a time ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 52.740166887s
Strategy: Parallel, 4 Workers ...
Total: 31521045, Matched: 710093, Ratio: 2.25%
Time: 27.207802611s