One Billion Row Challenge (1brc) using Crystal

Motivation

While Gunnar Morling originally posed The One Billion Row Challenge for Java developers, over time folks have started implementing it in different languages which are showcased here.

While reviewing the results in Java I ran across one implementation in C that made me wonder what I could do with Crystal.

See discussion in Crystal forum here. This led me to create TODO list of things to try.

See post in Gunnar's "Show & Tell" discussion area here

Overall performance (so far)

Updated 2024 Feb 11

On a PC with AMD Ryzen 7 7735HS CPU, 16 cores, 32 GB, and running Linux, comparing with some other 1brc contenders there's still room to do. See the TODO list for changes so far and possible further improvements. See this CSV for the detailed results.

relative	command	Lang	mean (s)	stddev
1.0x	serkan-ozal	Java	1.1661091954200002	0.10359078275160513
	dannyvankooten/analyze	C	1.24137285062	0.003261490844545827
	merykittyunsafe.sh	Java	1.34270993842	0.036888174169103186
	merykitty.sh	Java	1.51770504842	0.021277280568716347
1.74x slower	1brc_parallel_ptr4 16 48	Crystal	2.03081785402	0.06009663760371807
1.77x slower	1brc_parallel_mmap2b 16 32	Crystal	2.06915509142	0.01875921473395469

It's worth noting that there is no significant different between my two best results; given the way the file is read in parallel and in chunks, using mmap or not doesn't make a big difference.

Dependencies

None. Plain ol' Crystal.

Build all

... with `shards`

shards build --release -Dpreview_mt

... with `ops`

Use if you have brew on macOS or Linux.

First install crops and then

ops up
ops cbr -Dpreview_mt

Run

Status	Implementation	Description	Performance
Fastest	`1brc_parallel_ptr4`	Variant of `ptr3b` using long word name parsing	slightly faster
	`1brc_parallel_ptr3b`	Variant of `ptr3` using optimized hash map	much faster
	`1brc_parallel_ptr3`	Variant of `ptr2` which only launched N fibres and loops over D / N chunks.	faster
	`1brc_parallel_ptr2b`	Variant of `ptr2` using optimized hash map	much faster
	`1brc_parallel_ptr2`	Variant using page-aligned part size in anticipation of `mmap`-based implementation.	faster again
	`1brc_parallel_ptr`	Replaces `Slice` with `Pointer` to the buffer, to remove bounds checking when parsing.	faster
Fastest	`1brc_parallel_mmap3`	Variant of `mmap2b` using long word name parsing	slightly faster
	`1brc_parallel_mmap2b`	Variant of `mmap2` using optimized hash map	much faster
	`1brc_parallel_mmap2`	Variant of `mmap` which only launched N fibres and loops over D / N chunks.	slightly faster but not always
	`1brc_parallel_mmap`	Using `mmap` to load the file	faster
	`1brc_parallel2`	Variant using page-aligned part size in anticipation of `mmap`-based implementation.	faster
	`1brc_parallel`	Parallel multi-threaded implementation that chunks up the file and spawns fibres to process them	much faster
	`1brc_serial2`	Serial implementation optimized to use byte slices (`Bytes`)	a little faster, but still slow.
Slowest	`1brc_serial1`	A very simple serial implementation using `String` lines	slowest.

While you are welcome to run the serial implementatios, my focus from now on will on the parallel implementations.

The parallel implementations,

given the buffer division is D (via BUF_DIV_DENOM), and
given the number of threads is N (via CRYSTAL_WORKERS), and
given N < q where q is the number of chunks based on file_size / (Int32::MAX / D)

works as follows:

spawns q fibres, and
allocates N buffers, and
processes N chunks concurrently.

Note that since D is the denominator, we have the following number of chunks (and their sizes) based on D.

D	q chunks	chunk size
4	26	512 MB
5	33	410 MB
6	39	341 MB
7	46	293 MB
8	52	256 MB
12	78	171 MB
16	104	128 MB
24	156	85 MB
32	208	64 MB
48	312	43 MB

A script run.sh is provided to conveniently run one of the implementations and specify the concurrency values.

Make sure you have measurements.txt in the current folder, and then execute ./run.sh 1brc_parallel 32 24 to run the implementation with the specific threads (32) and buffer division (24).

If your machine is different from mine (see results below), send me your results.

Results

See the various file in /perfdata for results and analyses over time.

Contributing

Bug reports and sugestions are welcome.

This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.

License

This project is available as open source under the terms of the CC-BY-4.0 license.

Contributors

nogginly - creator and maintainer

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
perfdata		perfdata
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
ops.yml		ops.yml
run.sh		run.sh
shard.yml		shard.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perfdata

perfdata

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

TODO.md

TODO.md

ops.yml

ops.yml

run.sh

run.sh

shard.yml

shard.yml

Repository files navigation

One Billion Row Challenge (1brc) using Crystal

Motivation

Overall performance (so far)

Dependencies

Build all

... with `shards`

... with `ops`

Run

Results

Contributing

License

Contributors

About

Releases

Packages

Languages

License

nogginly/1brc.cr

Folders and files

Latest commit

History

Repository files navigation

One Billion Row Challenge (1brc) using Crystal

Motivation

Overall performance (so far)

Dependencies

Build all

... with shards

... with ops

Run

Results

Contributing

License

Contributors

About

Resources

License

Stars

Watchers

Forks

Languages

... with `shards`

... with `ops`