Skip to content

overengineered-dev/sluice

Repository files navigation

sluice

CI crates.io docs.rs License: Apache-2.0

A Rust parser and CLI for the Maven Central Nexus binary index format. Runs without a JVM, streams through the full index (~2.8 GB compressed, ~97M records) in a few minutes, and emits JSON Lines.

For a byte-level specification of the wire format, see docs/binary-format.md. The incremental-update protocol is covered in docs/incremental-updates.md.

Layout

The repo is a Cargo workspace with two crates. crates/core is the library, published as sluice-rs. It's I/O-neutral: it operates on any std::io::Read and has no knowledge of gzip, HTTP, files, or JSON. It parses the Nexus binary header and record stream, decodes fields (including CESU-8 strings), and classifies documents into descriptors, group lists, and artifact add/remove records with parsed UINFO tuples. crates/cli builds the sluice binary, which wraps the library with gzip decoding, argument parsing, and JSON Lines output.

Installation

Homebrew (macOS and Linux)

brew install overengineered-dev/tap/sluice

Cargo

cargo install sluice-cli

Prebuilt archives

Download the archive for your platform from the latest release, extract, and move sluice onto your PATH.

From source

git clone https://github.com/overengineered-dev/sluice
cd sluice
cargo install --path crates/cli

Quick start

# Parse a gzipped Maven Central index chunk and print artifact adds as
# JSON Lines (with stats on stderr).
sluice --stats chunk-latest.gz

Or stream the full Maven Central index straight from Apache without saving it to disk (~2.8 GB compressed, several minutes to parse):

curl -sL https://repo1.maven.org/maven2/.index/nexus-maven-repository-index.gz \
  | sluice --stats > artifacts.jsonl

Contributors working from a clone can use the just recipes — see Development below.

CLI options

sluice [OPTIONS] [INPUT]
  • INPUT — path to a gzipped Maven index file. Reads from stdin if omitted.
  • --include-removes — also emit ArtifactRemove records (type="remove") alongside adds.
  • --full — emit all records including classified artifacts (sources, javadoc, etc.) with their classifier and extension. By default, only root-level artifacts (classifier=NA) are emitted.
  • --stats — print summary stats to stderr at end of run.

Output is one JSON object per line, e.g.:

{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","extension":"jar"}

With --full, classified artifacts are included and the classifier field appears:

{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","extension":"jar"}
{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","classifier":"sources","extension":"jar"}
{"type":"add","group_id":"org.example","artifact_id":"lib","version":"1.0","classifier":"javadoc","extension":"jar"}

By default, records whose classifier is anything other than NA are filtered out. Use --full to include all records.

Library usage

The core library reads from any std::io::Read. For gzipped index files, bring your own decompressor — flate2 works. The crate is published as sluice-rs on crates.io; the import path is sluice:

[dependencies]
sluice-rs = "0.1"
flate2 = "1"
use std::fs::File;
use std::io::BufReader;
use flate2::read::GzDecoder;
use sluice::{IndexReader, Record};

let file = File::open("fixtures/chunk-latest.gz")?;
let gz = GzDecoder::new(BufReader::new(file));
let index = IndexReader::new(BufReader::new(gz))?;

for doc in index {
    let doc = doc?;
    // `Uinfo` implements `Display` as `groupId:artifactId:version[:classifier][:extension]`.
    match Record::try_from(&doc)? {
        Record::ArtifactAdd(u) => println!("add {u}"),
        Record::ArtifactRemove(u) => println!("del {u}"),
        // `Record` is `#[non_exhaustive]`; match `_` for descriptors, group lists,
        // and any future variants.
        _ => {}
    }
}

Serde support

Enable the serde feature to derive Serialize on all domain types (Record, Uinfo, Document, etc.):

[dependencies]
sluice-rs = { version = "0.1", features = ["serde"] }
serde_json = "1"
use sluice::{IndexReader, Record};

// ... set up IndexReader as above ...

for doc in index {
    let doc = doc?;
    if let Record::ArtifactAdd(ref uinfo) = Record::try_from(&doc)? {
        println!("{}", serde_json::to_string(uinfo)?);
    }
}

Performance

On the full Maven Central index (2.8 GB compressed, ~97M documents), sluice takes about 208 seconds end-to-end. The Java indexer-reader from Apache Maven Indexer takes about 1112 seconds on the same input.

Tool Mean Relative
sluice (Rust) 208s 1.00
indexer-reader (Java) 1112s 5.35

These numbers aren't directly comparable: the Java tool does additional per-record work (field expansion via RecordExpander) that sluice doesn't, so some of the gap is workload, not implementation. Output matches across all ~97M records. Methodology and reproduction steps are in docs/benchmark.md.

Development

Recipes are run through just (cargo install just or brew install just):

just fmt         # cargo fmt --all
just fmt-check   # cargo fmt --all -- --check
just lint        # cargo clippy --all-targets --all-features -- -D warnings
just test        # cargo test --all
just fetch-chunk # download the latest incremental chunk into fixtures/
just run-chunk   # parse fixtures/chunk-latest.gz with --stats
just fetch-full  # download the full Maven Central index (~2.8 GB)
just run-full    # parse the full index

The Rust toolchain is pinned via rust-toolchain.toml. MSRV is 1.75 for the library (sluice-rs) and 1.85 for the CLI (sluice-cli) — clap transitively requires edition2024. Lints are workspace-wide: rust_2018_idioms denied and clippy::pedantic at warn level.

Test fixtures

A small sample fixture (crates/core/tests/fixtures/chunk-sample.gz) is committed for offline testing. To regenerate it from a full Maven Central chunk:

just fetch-chunk
just regen-fixture

The full fixture is not committed to keep clone sizes small.

License

Apache-2.0.

About

A fast, streaming Rust parser for the Maven Central Nexus binary index format. Available as a library and CLI that turns index files into JSON Lines.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors