COST is an acronym for the "Configuration that Outperforms a Single Thread", indicating the hardware resources required by a distributed system before it begins to outperform a single-threaded implementation. This repository contains the single-threaded implementations providing the baseline performance.
Specifically, this repository contains single-threaded implementations of three graph algorithms, PageRank, label propagation, and union-find, supporting performance measurements taken on two graphs, twitter_rv and uk_2007_05. The code is intended to be instructive, rather than a meaningful replacement for a graph-processing system.
Once cloned, the project can be built and run by typing
cargo run --release
which should result in usage information indicating appropriate arguments:
Running `target/COST`
Invalid arguments.
Usage: COST pagerank (vertex | hilbert | compressed) <prefix>
COST label_prop (vertex | hilbert | compressed) <prefix>
COST union_find (vertex | hilbert | compressed) <prefix>
COST stats (vertex | hilbert | compressed) <prefix>
COST print (vertex | hilbert | compressed) <prefix>
COST to_hilbert [--dense] <prefix>
COST parse_to_hilbert
COST merge <source>...
COST twitter <from> <prefix>
The first three modes correspond to the three graph algorithms. The second parameter indicates the binary graph layout. The final parameter must be a path prefix for which certain extensions exist as files, discussed in a moment. The to_hilbert mode performs a graph layout according to a Hilbert space-filling curve, optionally densifying the identifiers.
The computation will not do anything productive without graph data, and the graph data use for experiments (processed versions of the graphs linked above) are too large to host here. I'm also not wild about distributing programs that write data back to someone else's computer, without more serious review. That being said, the file src/twitter_parser.rs
contains the code that I used to parse twitter_rv.net
, the file you get from the link above.
You can find the twitter files by searching for "twitter_rv.tar.gz" (6GB), "twitter_rv_19000000.tar.gz" (764MB) or "twitter_rv_small.tar.gz" (188MB). The sizes listed refer to the compressed size.
To process the small twitter dataset into a set of small.nodes and small.edges files, you can run:
./target/release/COST twitter twitter_rv_15066953.net small
To convert the small.nodes and small.edges file to hilbert, you can run:
./target/release/COST to_hilbert small
Which will create two files, small.upper and small.lower.
To compute pagerank, you can run:
./target/release/COST pagerank hilbert small
The required graph layout is quite simple (as is the code to parse it), and you should be able to write out your own graph data if you would like to try out the code.
The vertex
option requires a <prefix>
for which files <prefix>.nodes
and <prefix>.edges
exist. The two files should be binary, where
<prefix>.nodes
contains a sequence of(u32, u32)
pairs representing(node_id, degree)
.<prefix>.edges
contains a sequence ofu32
values indicating the concatenation of edge destinations for all nodes indicated above.
The program is just going to map these two files into memory and read them, so you want to make sure that data have the appropriate endian-ness for your system.
The hilbert
option requires a <prefix>
for which <prefix>.upper
and <prefix>.lower
exist. The two files should be binary, where
<prefix>.upper
contains a sequence of((u16, u16), u32)
values, indicating a pair of upper 16 bits of node identifiers, and a count of how many edges have this pair of upper 16 bits.<prefix>.lower
contains a concatenated sequence of(u16, u16)
values for the lower 16 bits for each edge in each group above.
The easiest way to get a feel for what these should look like is to invoke the to_hilbert
option with <prefix>
valid for data in the vertex
layout, and it will print to the screen what the data look like laid out in Hilbert format.
If you change the code to write the data to disk rather than to the terminal, you should be good to go (remember, .upper
and .lower
, not .nodes
and .edges
).
There is a companion COST repository managed by Microsoft Research, including the state of the project several months ago. This may be helpful if you are interested in the corresponding C# implementations. The repository also contains Naiad implementations that were done more recently. I am no longer affiliated with Microsoft and cannot commit to the repository (nor, historically, do they accept pull requests), and must apologize for the sorry state I left the code in. It may be cleaned up in the future (either by me, or other more industrious souls), given the right incentives.