sluice

A Rust parser and CLI for the Maven Central Nexus binary index format. Runs without a JVM, streams through the full index (~2.8 GB compressed, ~97M records) in a few minutes, and emits JSON Lines.

For a byte-level specification of the wire format, see docs/binary-format.md. The incremental-update protocol is covered in docs/incremental-updates.md.

Layout

The repo is a Cargo workspace with two crates. crates/core is the library, published as sluice-rs. It's I/O-neutral: it operates on any std::io::Read and has no knowledge of gzip, HTTP, files, or JSON. It parses the Nexus binary header and record stream, decodes fields (including CESU-8 strings), and classifies documents into descriptors, group lists, and artifact add/remove records with parsed UINFO tuples. crates/cli builds the sluice binary, which wraps the library with gzip decoding, argument parsing, and JSON Lines output.

Installation

Homebrew (macOS and Linux)

brew install overengineered-dev/tap/sluice

Cargo

cargo install sluice-cli

Prebuilt archives

Download the archive for your platform from the latest release, extract, and move sluice onto your PATH.

From source

git clone https://github.com/overengineered-dev/sluice
cd sluice
cargo install --path crates/cli

Quick start

# Parse a gzipped Maven Central index chunk and print artifact adds as
# JSON Lines (with stats on stderr).
sluice --stats chunk-latest.gz

Or stream the full Maven Central index straight from Apache without saving it to disk (~2.8 GB compressed, several minutes to parse):

curl -sL https://repo1.maven.org/maven2/.index/nexus-maven-repository-index.gz \
  | sluice --stats > artifacts.jsonl

Contributors working from a clone can use the just recipes — see Development below.

CLI options

sluice [OPTIONS] [INPUT]

INPUT — path to a gzipped Maven index file. Reads from stdin if omitted.
--include-removes — also emit ArtifactRemove records (type="remove") alongside adds.
--full — emit all records including classified artifacts (sources, javadoc, etc.) with their classifier and extension. By default, only root-level artifacts (classifier=NA) are emitted.
--stats — print summary stats to stderr at end of run.

Output is one JSON object per line, e.g.:

{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","extension":"jar"}

With --full, classified artifacts are included and the classifier field appears:

{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","extension":"jar"}
{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","classifier":"sources","extension":"jar"}
{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","classifier":"javadoc","extension":"jar"}

By default, records whose classifier is anything other than NA are filtered out. Use --full to include all records.

Library usage

The core library reads from any std::io::Read. For gzipped index files, bring your own decompressor — flate2 works. The crate is published as sluice-rs on crates.io; the import path is sluice:

[dependencies]
sluice-rs = "0.1"
flate2 = "1"

use std::fs::File;
use std::io::BufReader;
use flate2::read::GzDecoder;
use sluice::{IndexReader, Record};

let file = File::open("fixtures/chunk-latest.gz")?;
let gz = GzDecoder::new(BufReader::new(file));
let index = IndexReader::new(BufReader::new(gz))?;

for doc in index {
    let doc = doc?;
    // `Uinfo` implements `Display` as `groupId:artifactId:version[:classifier][:extension]`.
    match Record::try_from(&doc)? {
        Record::ArtifactAdd(u) => println!("add {u}"),
        Record::ArtifactRemove(u) => println!("del {u}"),
        // `Record` is `#[non_exhaustive]`; match `_` for descriptors, group lists,
        // and any future variants.
        _ => {}
    }
}

Serde support

Enable the serde feature to derive Serialize on all domain types (Record, Uinfo, Document, etc.):

[dependencies]
sluice-rs = { version = "0.1", features = ["serde"] }
serde_json = "1"

use sluice::{IndexReader, Record};

// ... set up IndexReader as above ...

for doc in index {
    let doc = doc?;
    if let Record::ArtifactAdd(ref uinfo) = Record::try_from(&doc)? {
        println!("{}", serde_json::to_string(uinfo)?);
    }
}

Performance

On the full Maven Central index (2.8 GB compressed, ~97M documents), sluice takes about 208 seconds end-to-end. The Java indexer-reader from Apache Maven Indexer takes about 1112 seconds on the same input.

Tool	Mean	Relative
sluice (Rust)	208s	1.00
indexer-reader (Java)	1112s	5.35

These numbers aren't directly comparable: the Java tool does additional per-record work (field expansion via RecordExpander) that sluice doesn't, so some of the gap is workload, not implementation. Output matches across all ~97M records. Methodology and reproduction steps are in docs/benchmark.md.

Development

Recipes are run through just (cargo install just or brew install just):

just fmt         # cargo fmt --all
just fmt-check   # cargo fmt --all -- --check
just lint        # cargo clippy --all-targets --all-features -- -D warnings
just test        # cargo test --all
just fetch-chunk # download the latest incremental chunk into fixtures/
just run-chunk   # parse fixtures/chunk-latest.gz with --stats
just fetch-full  # download the full Maven Central index (~2.8 GB)
just run-full    # parse the full index

The Rust toolchain is pinned via rust-toolchain.toml. MSRV is 1.75 for the library (sluice-rs) and 1.85 for the CLI (sluice-cli) — clap transitively requires edition2024. Lints are workspace-wide: rust_2018_idioms denied and clippy::pedantic at warn level.

Test fixtures

A small sample fixture (crates/core/tests/fixtures/chunk-sample.gz) is committed for offline testing. To regenerate it from a full Maven Central chunk:

just fetch-chunk
just regen-fixture

The full fixture is not committed to keep clone sizes small.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
crates		crates
docs		docs
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
deny.toml		deny.toml
justfile		justfile
release-plz.toml		release-plz.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sluice

Layout

Installation

Homebrew (macOS and Linux)

Cargo

Prebuilt archives

From source

Quick start

CLI options

Library usage

Serde support

Performance

Development

Test fixtures

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sluice

Layout

Installation

Homebrew (macOS and Linux)

Cargo

Prebuilt archives

From source

Quick start

CLI options

Library usage

Serde support

Performance

Development

Test fixtures

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages