Skip to content

nssalian/parx

Repository files navigation

PARX

CI crates.io docs.rs

Early Development: This project is in active development. Format and APIs may change.

Persistent metadata caching for Parquet files.

What It Does

PARX caches Parquet metadata in sidecar files (.parx) to eliminate repeated metadata fetches.

The problem: Parquet stores metadata at the end of files. Reading it requires 3 requests: HEAD for file size, GET_RANGE for footer length, GET_RANGE for footer. When 10 workers read the same files, that's 30 requests per file.

The solution: Cache metadata once in a .parx sidecar. All workers fetch the sidecar (1 request) instead of reading the footer (3 requests).

file.parquet (2.7 MB)
file.parquet.parx (282 KB)

Format

Single-file format:

┌──────────────────────────────────────────┐
│ Header (16 bytes)                        │
│  - Magic: "PARX"                         │
│  - Version, Flags                        │
├──────────────────────────────────────────┤
│ Footer Payload (variable, raw/compressed)│
│  - Raw Parquet footer bytes              │
├──────────────────────────────────────────┤
│ Page Index Payload (optional)            │
│  - ColumnIndex + OffsetIndex             │
├──────────────────────────────────────────┤
│ Manifest (Protobuf)                      │
│  - Offsets, lengths, checksums           │
│  - Source file size                      │
├──────────────────────────────────────────┤
│ Trailer (12 bytes)                       │
│  - Manifest length, CRC32C               │
│  - Magic: "PARX"                         │
└──────────────────────────────────────────┘

Bundle format (for directories):

┌──────────────────────────────────────────┐
│ Bundle Header (24 bytes)                 │
│  - Magic: "PRXB"                         │
│  - Version, Flags                        │
│  - Entry count                           │
├──────────────────────────────────────────┤
│ Entry 0: Footer (+ optional page indexes)│
├──────────────────────────────────────────┤
│ Entry 1: Footer (+ optional page indexes)│
├──────────────────────────────────────────┤
│ ... (N entries)                          │
├──────────────────────────────────────────┤
│ Bundle Manifest (Protobuf)               │
│  - Path→Entry mapping                    │
├──────────────────────────────────────────┤
│ Trailer (12 bytes)                       │
│  - Manifest length, CRC32C               │
│  - Magic: "PRXB"                         │
└──────────────────────────────────────────┘

See FORMAT_SPEC.md for detailed byte-level layout. Bundle entries can optionally include page-index payloads using policy-controlled caps.

Building

# Core library
cd implementations/rust/parx
cargo build --release

# CLI tool (install to ~/.cargo/bin)
cd implementations/rust/parx-cli
cargo install --path . --locked

# Benchmarks
cd benchmarks/parx_benchmarks
make all

If parx is not found after install, ensure ~/.cargo/bin is on your PATH.

CLI Usage

# Build sidecar for single file
parx build file.parquet

# Verify sidecar
parx verify file.parquet.parx

# Inspect contents
parx inspect file.parquet.parx

# Bundle directory
parx bundle build /data/events/
# Creates: /data/events/_parx_bundle.parx

# Bundle directory with page indexes (optional, capped)
parx bundle build /data/events/ \
  --include-page-indexes \
  --max-page-index-bytes-per-file 262144 \
  --max-total-page-index-bytes 16777216

# Extract bundle
parx bundle extract /data/events/_parx_bundle.parx --output /output/

For a full local CLI validation walkthrough (install + generated fixtures + end-to-end commands), see: docs/CLI_SMOKE_TEST.md

Library Usage

use parx_rs::{ParxReader, ParxWriter};

// Write: build from Parquet file directly
let mut writer = ParxWriter::from_parquet_file("file.parquet")?;
let parx_bytes = writer.finish();
std::fs::write("file.parquet.parx", parx_bytes)?;

// Read: load cached footer from .parx sidecar
let parx_data = std::fs::read("file.parquet.parx")?;
let reader = ParxReader::open(&parx_data)?;
let footer = reader.footer_bytes(); // Raw Parquet footer, ready to use

Benchmarks

Local tests with 4 schema types (simple, medium, wide, nested):

Arrow async vs PARX:

  • Requests: 3.0 → 1.0 per file (66.7% reduction)
  • Latency: ~428 µs → ~208 µs (~2x faster)
  • Bytes: ~25 KB → ~26 KB (~2% overhead)

Note: this benchmark measures metadata read path with prebuilt .parx sidecars; one-time sidecar creation is excluded.

Run benchmarks:

cd benchmarks/parx_benchmarks
make arrow-vs-parx  # Arrow vs PARX comparison
make prefetch       # Prefetch hint testing

When to Use

Use it:

  • Cloud storage (S3/GCS/Azure)
  • Multiple processes reading same files
  • Immutable or versioned files
  • Parquet V2 with page indexes (including bundle mode with policy caps)

Skip it:

  • Single-process work (in-memory cache is fine)
  • Local SSD (minimal benefit)
  • Delta/Iceberg/Hudi tables (built-in metadata)
  • Frequently updated files

Project Structure

parx/
├── implementations/rust/
│   ├── parx/              # Core library
│   └── parx-cli/          # CLI tool
├── benchmarks/
│   └── parx_benchmarks/   # Performance tests
├── spec/
│   └── proto/             # Protobuf schema
└── FORMAT_SPEC.md         # Format specification

Testing

# Unit tests
cargo test

# Integration tests
cd implementations/rust/parx-cli
cargo test --test cli_integration

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages