Skip to content

High-performance S3 prefetching FUSE filesystem for ML/AI training workloads.

License

Notifications You must be signed in to change notification settings

misran3/valkyrie-fs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Valkyrie-FS

High-performance S3 prefetching FUSE filesystem for ML/AI training workloads.

Problem

AI training workloads read thousands of sequential shards from S3. Without prefetching, GPUs sit idle waiting for I/O, achieving <50% utilization.

Solution

Valkyrie-FS intelligently prefetches upcoming shards while the GPU processes current data, achieving 95%+ GPU utilization.

Performance

Benchmarked on macOS with 10MB test files to S3 (us-east-1):

Metric Direct S3 Cold Cache Warm Cache Improvement
Throughput 1.10 MB/s 1.55 MB/s 15.06 MB/s 13.6x
Time to First Byte N/A 1.67s 10ms 99.4%
Cache Hit Rate 0% 0% 100% -

Key Results:

  • 13.6x faster than direct S3 downloads on cached reads
  • 99.4% reduction in time to first byte (warm cache)
  • 100% data integrity verified via MD5 checksums

Real-world performance depends on network speed, file size, and access patterns. Sequential workloads see the greatest benefit from prefetching.

Features

  • Chunk-based caching: 4MB chunks for instant response on large files
  • Two-tier cache: Hot (LRU) + Prefetch (FIFO) zones prevent cache pollution
  • Intelligent prediction: Sequential pattern detection + manifest support
  • Production-grade: Prometheus metrics, structured logging, trace files

Build

Dependencies

Ubuntu/Debian:

sudo apt install libfuse3-dev cmake g++ libssl-dev libcurl4-openssl-dev

macOS:

brew install macfuse cmake

Note: macFUSE may require system reboot after installation.

Install AWS SDK for C++

Ubuntu/Debian:

sudo apt install libaws-cpp-sdk-s3-dev

macOS:

brew install aws-sdk-cpp

From source (if package not available):

git clone --depth 1 --branch 1.11.200 https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3" -DENABLE_TESTING=OFF
make -j$(nproc)
sudo make install

Compile

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

# or
make -j$(sysctl -n hw.logicalcpu)

Run

sudo ./build/bin/valkyrie \
  --mount /mnt/valkyrie \
  --bucket my-training-data \
  --cache-size 16G \
  --workers 8

Testing

Unit Tests

cd build
make test_types && ./bin/test_types
make test_queue && ./bin/test_queue
make test_cache_manager && ./bin/test_cache_manager
make test_s3_mock && ./bin/test_s3_mock

S3 Integration Test

Requires AWS credentials and a test bucket:

export TEST_BUCKET=your-test-bucket
export TEST_REGION=us-east-1
./scripts/test_s3_integration.sh

Architecture

See docs/plans/2026-01-18-valkyrie-fs-design.md

Usage

Mounting

Mount an S3 bucket as a local filesystem:

sudo ./build/bin/valkyrie \
  --mount /mnt/valkyrie \
  --bucket my-training-data \
  --cache-size 16G \
  --workers 8 \
  --lookahead 32 \
  --region us-east-1

The mount point must exist and be empty:

sudo mkdir -p /mnt/valkyrie

Accessing Files

Once mounted, access files like any local directory:

# List files
ls -lh /mnt/valkyrie/shards/

# Read a file
cat /mnt/valkyrie/data/shard_0001.tar

# Stream to training script
python train.py --data-path /mnt/valkyrie/shards/

Valkyrie-FS detects sequential access patterns and prefetches upcoming files automatically.

Using a Manifest

For best performance, provide a manifest file listing files in training order:

# Create manifest with training file order
cat > /tmp/training_manifest.txt <<EOF
shards/shard_0001.tar
shards/shard_0002.tar
shards/shard_0003.tar
EOF

# Mount with manifest
sudo ./build/bin/valkyrie \
  --mount /mnt/valkyrie \
  --bucket my-training-data \
  --manifest /tmp/training_manifest.txt \
  --cache-size 16G \
  --workers 8

With a manifest, prefetching starts immediately without waiting for pattern detection.

Unmounting

Unmount when finished:

sudo umount /mnt/valkyrie

On Linux, you can also use:

fusermount -u /mnt/valkyrie

Performance Tips

Cache Size

Set cache size based on your shard size and prefetch needs:

# Small shards (< 100MB): 4-8GB cache
--cache-size 4G

# Medium shards (100-500MB): 8-16GB cache
--cache-size 16G

# Large shards (> 500MB): 32GB+ cache
--cache-size 32G

Formula: cache_size = shard_size * (lookahead + 2)

The "+2" accounts for the current file being read plus one buffer.

Worker Threads

Configure workers based on your CPU and network:

# Network-bound (1 Gbps): 4-8 workers
--workers 4

# Balanced (10 Gbps): 8-16 workers
--workers 8

# CPU-bound or very fast network: 16-32 workers
--workers 16

More workers help when S3 latency is high or you need aggressive prefetching.

Lookahead Distance

Set lookahead based on your read speed vs network speed:

# Fast local NVMe, slow network: prefetch more
--lookahead 64

# Balanced: default works well
--lookahead 32

# Very fast network, slower processing: prefetch less
--lookahead 16

Monitor cache hit rate in metrics. Increase lookahead if you see cache misses.

Manifest Files

Always use a manifest for training workloads:

  1. Eliminates cold start: Prefetching starts immediately
  2. Perfect prediction: No pattern detection needed
  3. Optimal scheduling: Workers load files in exact order

Generate manifest from your dataloader:

# PyTorch example
with open('manifest.txt', 'w') as f:
    for shard_path in dataset.get_shard_paths():
        f.write(f"{shard_path}\n")

Troubleshooting

Mount Failures

Error: "cannot mount: /mnt/valkyrie not found"

  • Create the mount point: sudo mkdir -p /mnt/valkyrie

Error: "Transport endpoint is not connected"

  • Previous mount still active. Unmount first: sudo umount /mnt/valkyrie
  • On macOS, you may need to force unmount: sudo umount -f /mnt/valkyrie

Error: "Permission denied"

  • FUSE filesystem requires root access. Use sudo
  • On macOS, check that macFUSE kernel extension is loaded: kextstat | grep fuse

Error: "AWS credentials not found"

  • Set AWS credentials via environment variables:
    export AWS_ACCESS_KEY_ID=your_key
    export AWS_SECRET_ACCESS_KEY=your_secret
  • Or configure via aws configure

Slow Reads

First read is slow, then fast

  • Normal behavior. First read triggers prefetch, subsequent reads hit cache
  • Use --manifest to start prefetching before first read

All reads are slow

  • Check cache size: --cache-size may be too small for your shards
  • Check workers: Increase --workers if CPU allows
  • Check S3 region: Use --region closest to your location
  • Verify sequential access: Random access defeats prefetching

Cache thrashing

  • Reduce --lookahead if working set exceeds cache size
  • Increase --cache-size if possible

High Cache Miss Rate

Check Prometheus metrics at http://localhost:9090/metrics:

curl http://localhost:9090/metrics | grep cache_hit_rate

Cache hit rate < 80%

  • Increase --cache-size
  • Increase --lookahead to prefetch earlier
  • Verify access is sequential (not random)

Prefetch queue empty

  • Pattern not detected yet (first few files)
  • Use --manifest for immediate prefetching
  • Check logs for pattern detection messages

High memory usage

  • Reduce --cache-size
  • Reduce --lookahead
  • Check for memory leaks in logs

Debugging

Run with basic mount (diagnostic info printed to stdout/stderr):

sudo ./build/bin/valkyrie --mount /mnt/valkyrie --bucket my-data --region us-east-1

Check trace files for detailed operation logs:

ls -lh /tmp/valkyrie_trace_*.json

View Prometheus metrics:

curl http://localhost:9090/metrics

Status

  • Phase 1: Build system ✅
  • Phase 2: Core data structures ✅
  • Phase 3: S3 Worker Pool ✅
  • Phase 4: Prefetch Engine ✅
  • Phase 5: FUSE Filesystem ✅
  • Phase 6: Metrics & Observability ✅
  • Phase 7: Command-line Interface ✅
  • Phase 8: Integration & Testing ✅

License

MIT License. See LICENSE file for details.

About

High-performance S3 prefetching FUSE filesystem for ML/AI training workloads.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published