COST

COST is an acronym for the "Configuration that Outperforms a Single Thread", indicating the hardware resources required by a distributed system before it begins to outperform a single-threaded implementation. This repository contains the single-threaded implementations providing the baseline performance.

Specifically, this repository contains single-threaded implementations of three graph algorithms, PageRank, label propagation, and union-find, supporting performance measurements taken on two graphs, twitter_rv and uk_2007_05. The code is intended to be instructive, rather than a meaningful replacement for a graph-processing system.

Instructions

Once cloned, the project can be built and run by typing

cargo run --release

which should result in usage information indicating appropriate arguments:

Running `target/COST`
Invalid arguments.

Usage: COST pagerank  (vertex | hilbert | compressed) <prefix>
       COST label_prop (vertex | hilbert | compressed) <prefix>
       COST union_find (vertex | hilbert | compressed) <prefix>
       COST stats  (vertex | hilbert | compressed) <prefix>
       COST print  (vertex | hilbert | compressed) <prefix>
       COST to_hilbert [--dense] <prefix>
       COST parse_to_hilbert
       COST merge <source>...
       COST twitter <from> <prefix>

The first three modes correspond to the three graph algorithms. The second parameter indicates the binary graph layout. The final parameter must be a path prefix for which certain extensions exist as files, discussed in a moment. The to_hilbert mode performs a graph layout according to a Hilbert space-filling curve, optionally densifying the identifiers.

Graph input

The computation will not do anything productive without graph data, and the graph data use for experiments (processed versions of the graphs linked above) are too large to host here. I'm also not wild about distributing programs that write data back to someone else's computer, without more serious review. That being said, the file src/twitter_parser.rs contains the code that I used to parse twitter_rv.net, the file you get from the link above.

You can find the twitter files by searching for "twitter_rv.tar.gz" (6GB), "twitter_rv_19000000.tar.gz" (764MB) or "twitter_rv_small.tar.gz" (188MB). The sizes listed refer to the compressed size.

To process the small twitter dataset into a set of small.nodes and small.edges files, you can run:

./target/release/COST twitter twitter_rv_15066953.net small

To convert the small.nodes and small.edges file to hilbert, you can run:

./target/release/COST to_hilbert small

Which will create two files, small.upper and small.lower.

To compute pagerank, you can run:

./target/release/COST pagerank hilbert small

The required graph layout is quite simple (as is the code to parse it), and you should be able to write out your own graph data if you would like to try out the code.

The vertex option requires a <prefix> for which files <prefix>.nodes and <prefix>.edges exist. The two files should be binary, where

<prefix>.nodes contains a sequence of (u32, u32) pairs representing (node_id, degree).
<prefix>.edges contains a sequence of u32 values indicating the concatenation of edge destinations for all nodes indicated above.

The program is just going to map these two files into memory and read them, so you want to make sure that data have the appropriate endian-ness for your system.

The hilbert option requires a <prefix> for which <prefix>.upper and <prefix>.lower exist. The two files should be binary, where

<prefix>.upper contains a sequence of ((u16, u16), u32) values, indicating a pair of upper 16 bits of node identifiers, and a count of how many edges have this pair of upper 16 bits.
<prefix>.lower contains a concatenated sequence of (u16, u16) values for the lower 16 bits for each edge in each group above.

The easiest way to get a feel for what these should look like is to invoke the to_hilbert option with <prefix> valid for data in the vertex layout, and it will print to the screen what the data look like laid out in Hilbert format. If you change the code to write the data to disk rather than to the terminal, you should be good to go (remember, .upper and .lower, not .nodes and .edges).

Notes

There is a companion COST repository managed by Microsoft Research, including the state of the project several months ago. This may be helpful if you are interested in the corresponding C# implementations. The repository also contains Naiad implementations that were done more recently. I am no longer affiliated with Microsoft and cannot commit to the repository (nor, historically, do they accept pull requests), and must apologize for the sorry state I left the code in. It may be cleaned up in the future (either by me, or other more industrious souls), given the right incentives.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

COST

Instructions

Graph input

Notes

About

Releases

Packages

Languages

License

maxdemarzi/COST

Folders and files

Latest commit

History

Repository files navigation

COST

Instructions

Graph input

Notes

About

Resources

License

Stars

Watchers

Forks

Languages