Bucket Delta

A high-performance, enterprise-grade tool for identifying differences between S3-compatible buckets. Built by Nutanix for production workloads requiring reliable bucket synchronization verification, migration validation, and compliance auditing.

Introduction

The Bucket Delta Diff Checker is a Python-based utility designed to efficiently compare two S3-compatible buckets and identify discrepancies. Whether you're validating data migrations, ensuring backup integrity, or maintaining compliance across multi-cloud environments, this tool provides comprehensive bucket analysis with enterprise-grade performance and reliability.

Key Capabilities

Comprehensive Comparison: Detect missing objects and metadata differences
High Performance: Multi-process architecture with configurable concurrency for optimal throughput
Enterprise Ready: Built for production workloads with robust error handling and detailed logging
Detailed Reporting: Comprehensive difference reports with customizable output formats
Resumability: Checkpoint-based recovery allows resuming interrupted comparisons

Architecture Overview

graph TB
    subgraph "Input Sources"
        SRC["Source Bucket
S3 Compatible"]
        DST["Destination Bucket
S3 Compatible"]
    end

    subgraph "Bucket Delta Diff Checker"
        subgraph "Main Process"
            LST1["List Objects API
Source"]
            LST2["List Objects API
Destination"]
            MERGE["Merge Sort Algorithm
O(n+m) Complexity"]
            SHALLOW["Shallow Check
Basic Metadata"]
        end

        subgraph "Deep Check Pipeline"
            WQ["Work Queue
Configurable Size"]
            WP1[Worker Process 1]
            WP2[Worker Process 2]
            WPN[Worker Process N]
            DEEP["Deep Check
Tags, WORM, Custom Metadata"]
        end

        subgraph "Reporting"
            RQ[Report Queue]
            RP[Report Process]
            CP["Checkpoint File
/tmp/diff_checker_*.checkpoint"]
        end
    end

    subgraph "Output"
        RPT["Difference Report
Structured Output"]
        LOG["Execution Logs
Configurable Levels"]
        STATS["Performance Statistics
Throughput Metrics"]
    end

    SRC --> LST1
    DST --> LST2
    LST1 --> MERGE
    LST2 --> MERGE

    MERGE --> SHALLOW
    SHALLOW --> RQ
    SHALLOW --> WQ

    WQ --> WP1
    WQ --> WP2
    WQ --> WPN

    WP1 --> DEEP
    WP2 --> DEEP
    WPN --> DEEP

    DEEP --> RQ
    RQ --> RP
    RP --> RPT
    RP --> CP

    MERGE --> LOG
    MERGE --> STATS

    style SRC fill:#e1f5fe
    style CP fill:#ffe0b2
    style DST fill:#e1f5fe
    style MERGE fill:#fff3e0
    style SHALLOW fill:#f3e5f5
    style DEEP fill:#e8f5e8
    style RPT fill:#fff8e1

Features

Purpose-Built: Specifically designed for bucket comparison, not general file transfer
Deep Analysis: Goes beyond basic file comparison to examine S3-specific metadata
Scalable: Handles enterprise-scale buckets with millions of objects efficiently
Configurable: Extensive tuning options for different hardware and network conditions
Resumable: Checkpoint-based resumability with automatic progress saving to /tmp/diff_checker_{src_bucket}_{dst_bucket}.checkpoint. Optimal for long-running comparisons with millions of objects.

Core Functionality

Shallow Check Mode (Default)

Fast Object Enumeration: Quickly identifies missing objects between buckets
Basic Metadata Comparison: Compares ETag and size
Optimal for: Initial migration validation, quick consistency checks
S3 APIs used: ListObjectsV2 (on both source and destination buckets)

Deep Check Mode

Comprehensive Metadata Analysis: Examines object tags, custom metadata, and WORM settings
Advanced Comparison: Includes ObjectLock configuration and retention mode
Optimal for: Compliance auditing, detailed migration verification
S3 APIs used: ListObjectsV2 (on both buckets) + HeadObject and GetObjectTagging (per object, on both buckets)

Performance Features

Multi-Processing Architecture

graph LR
    subgraph "Producer"
        MAIN["Main Process
List & Compare"]
    end

    subgraph "Consumer Pool"
        W1["Worker 1
Deep Check"]
        W2["Worker 2
Deep Check"]
        WN["Worker N
Deep Check"]
    end

    subgraph "Reporting"
        RQ[Report Queue]
        REP["Report Process
Output Generation"]
    end

    MAIN -->|Work Queue| W1
    MAIN -->|Work Queue| W2
    MAIN -->|Work Queue| WN

    W1 --> RQ
    W2 --> RQ
    WN --> RQ
    RQ --> REP

    style MAIN fill:#e3f2fd
    style W1 fill:#f1f8e9
    style W2 fill:#f1f8e9
    style WN fill:#f1f8e9
    style RQ fill:#fff3e0
    style REP fill:#fff3e0

Advanced Configuration

Configurable Worker Processes: Scale processing based on available CPU cores
Connection Pool Management: Optimize network utilization with configurable connection limits
Memory Management: Queue size limits prevent memory exhaustion on large datasets
Progress Reporting: Real-time throughput metrics and completion estimates

Enterprise Features

Comprehensive Reporting

Structured Output: Machine-readable difference reports
Performance Metrics: Execution time, throughput, and processing statistics
Detailed Logging: Configurable log levels for debugging and monitoring

Security & Compliance

Secure Credential Handling: Support for access key and secret key authentication
Audit Trail: Detailed logs suitable for compliance reporting
Error Handling: Robust error recovery and reporting mechanisms

Quick Start

Prerequisites

Python 3.7+ with pip
Network Access to both source and destination S3 endpoints
Credentials with read access to both buckets

Installation

To use the bucket_delta tool, follow these steps:

Download and install pip on the linux vm:

curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python3 get-pip.py

Install pre-requisites using:
```
sudo pip3 install -r requirements.txt
```

Run the tool:

Option A: Run directly from source:

python3 bucket_delta.py --help

Option B: Build standalone executable:

# Install PyInstaller
pip3 install pyinstaller


# Option 2: Using PyInstaller directly
python3 -m PyInstaller --onefile bucket_delta.py --name bucket_delta --hidden-import=ConfigParser

Basic Usage

# Simple bucket comparison
python3 bucket_delta.py \
  --source_bucket_name my-source-bucket \
  --source_endpoint_url https://s3.amazonaws.com \
  --source_access_key AKIA... \
  --source_secret_key secret... \
  --destination_bucket_name my-dest-bucket \
  --destination_endpoint_url https://s3.amazonaws.com \
  --destination_access_key AKIA... \
  --destination_secret_key secret... \
  --output_file differences.txt

Advanced Usage Examples

Deep Metadata Comparison

python3 bucket_delta.py \
  --source_bucket_name production-data \
  --source_endpoint_url https://s3.us-west-2.amazonaws.com \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_bucket_name backup-data \
  --destination_endpoint_url https://backup.company.com \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  --output_file detailed_differences.txt \
  --shallow_check false \
  --compare_object_tags true \
  --num_processes 20 \
  --num_connections 10 \
  --log_level DEBUG

Resuming After Failure

# If a previous run was interrupted, resume from checkpoint
python3 bucket_delta.py \
  --source_bucket_name production-data \
  --source_endpoint_url https://s3.us-west-2.amazonaws.com \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_bucket_name backup-data \
  --destination_endpoint_url https://backup.company.com \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  --output_file detailed_differences.txt \
  --resume

The tool automatically saves checkpoints every --num_objects_report objects (default: 1000) to /tmp/diff_checker_{source_bucket}_{dest_bucket}.checkpoint. On successful completion, the checkpoint file is automatically deleted.

Configuration Reference

Required Parameters

Parameter	Description	Example
`--source_bucket_name`	Name of the source bucket	`my-source-bucket`
`--source_endpoint_url`	S3 endpoint URL for source	`https://s3.amazonaws.com`
`--source_access_key`	Access key for source bucket	`AKIAIOSFODNN7EXAMPLE`
`--source_secret_key`	Secret key for source bucket	`wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`
`--destination_bucket_name`	Name of the destination bucket	`my-dest-bucket`
`--destination_endpoint_url`	S3 endpoint URL for destination	`https://s3.amazonaws.com`
`--destination_access_key`	Access key for destination bucket	`AKIAIOSFODNN7EXAMPLE`
`--destination_secret_key`	Secret key for destination bucket	`wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`
`--output_file`	Path to output differences file	`./differences.txt`

Optional Parameters

Comparison Mode

Parameter	Type	Default	Description
`--shallow_check`	boolean	`true`	Enable shallow check mode (basic metadata only)
`--compare_object_tags`	boolean	`true`	Compare object tags during deep checks

Resumability

Parameter	Type	Default	Description
`--resume`	flag	`false`	Resume from last checkpoint if available

Performance Tuning

Parameter	Type	Default	Description
`--num_processes`	integer	`10`	Number of worker processes for parallel processing
`--num_connections`	integer	`10`	Number of HTTP connections per S3 client
`--max_queue_size`	integer	`100`	Maximum items in work queue (memory management)
`--num_objects_report`	integer	`1000`	Progress reporting interval (objects processed)

Note: Each worker process creates its own S3 client with num_connections connections. The total number of concurrent HTTP connections to each endpoint is num_processes × num_connections (e.g., 10 processes × 10 connections = 100 connections per endpoint).

Logging

Parameter	Type	Default	Options	Description
`--log_level`	string	`INFO`	`DEBUG`, `WARN`, `INFO`, `ERROR`	Logging verbosity level

Output Format

Difference Report Structure

The tool generates a structured text file with the following format:

Only in source bucket my-source-bucket, Differences: ['Key: file1.txt', 'ETag: abc123', 'Size: 1024 bytes']
Only in destination bucket my-dest-bucket, Differences: ['Key: file2.txt', 'ETag: def456', 'Size: 2048 bytes']
Difference in Key: file3.txt, Differences: ['ETag differs', 'Size differs']
Difference in Key: file4.txt, Differences: ['Tags differ', 'ObjectLockMode differs']

Log Output Example

[2025-01-15 10:30:15 - INFO] - Validating input arguments
[2025-01-15 10:30:15 - INFO] - Input arguments validated successfully
[2025-01-15 10:30:15 - INFO] - Log level set to INFO
[2025-01-15 10:30:16 - INFO] - Listing all objects from source bucket: my-source-bucket
[2025-01-15 10:30:16 - INFO] - Listing all objects from destination bucket: my-dest-bucket
[2025-01-15 10:30:16 - INFO] - Starting bucket comparison
[2025-01-15 10:30:16 - INFO] - Mode: latest objects only
[2025-01-15 10:30:25 - INFO] - Processed 1000 objects in 8.45 seconds, total elapsed 8.45 seconds
[2025-01-15 10:30:35 - INFO] - Processed 2000 objects in 9.12 seconds, total elapsed 17.57 seconds
[2025-01-15 10:30:42 - INFO] - ------Final report------
[2025-01-15 10:30:42 - INFO] - Total objects processed: 2543
[2025-01-15 10:30:42 - INFO] - Total objects with differences: 15
[2025-01-15 10:30:42 - INFO] - Total execution time: 25.83 seconds
[2025-01-15 10:30:42 - INFO] - Throughput: 98.45 objects/second
[2025-01-15 10:30:42 - INFO] - ------End of report------

Advanced Usage

Environment Variables

For enhanced security, credentials can be provided via environment variables:

export SRC_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export SRC_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export DST_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export DST_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

python3 bucket_delta.py \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  # ... other parameters

Batch Processing Script

#!/bin/bash
# batch_compare.sh - Compare multiple bucket pairs

BUCKET_PAIRS=(
  "prod-data:backup-data"
  "user-uploads:user-backup"
  "analytics:analytics-backup"
)

for pair in "${BUCKET_PAIRS[@]}"; do
  IFS=':' read -r src dst <<< "$pair"
  echo "Comparing $src -> $dst"

  python3 bucket_delta.py \
    --source_bucket_name "$src" \
    --destination_bucket_name "$dst" \
    --source_endpoint_url https://s3.amazonaws.com \
    --destination_endpoint_url https://backup.company.com \
    --source_access_key $SRC_ACCESS_KEY \
    --source_secret_key $SRC_SECRET_KEY \
    --destination_access_key $DST_ACCESS_KEY \
    --destination_secret_key $DST_SECRET_KEY \
    --output_file "differences_${src}_${dst}.txt" \
    --log_level INFO
done

Integration with CI/CD

# .github/workflows/bucket-validation.yml
name: Bucket Validation
on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  validate-backups:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Validate Production Backup
        run: |
          python3 bucket_delta.py \
            --source_bucket_name production-data \
            --destination_bucket_name backup-data \
            --source_endpoint_url ${{ secrets.PROD_S3_ENDPOINT }} \
            --destination_endpoint_url ${{ secrets.BACKUP_S3_ENDPOINT }} \
            --source_access_key ${{ secrets.PROD_ACCESS_KEY }} \
            --source_secret_key ${{ secrets.PROD_SECRET_KEY }} \
            --destination_access_key ${{ secrets.BACKUP_ACCESS_KEY }} \
            --destination_secret_key ${{ secrets.BACKUP_SECRET_KEY }} \
            --output_file backup_validation.txt \
            --shallow_check false

      - name: Upload Results
        uses: actions/upload-artifact@v3
        with:
          name: validation-results
          path: backup_validation.txt

Troubleshooting

Common Issues

High Memory Usage

Symptom: Process consumes excessive memory Solution: Reduce --max_queue_size and --num_processes

--max_queue_size 25 --num_processes 5

Slow Performance

Symptom: Low throughput (< 10 objects/second) Solutions:

Increase concurrency: --num_processes 20 --num_connections 50
Use shallow check for initial validation: --shallow_check true
Reduce reporting frequency: --num_objects_report 5000

Connection Timeouts

Symptom: Frequent network errors Solutions:

Reduce connection count: --num_connections 10
Check network connectivity to S3 endpoints

Permission Errors

Symptom: Access denied errors Solutions:

Verify bucket permissions: s3:ListBucket, s3:GetObject, s3:GetObjectTagging
Check endpoint URLs are correct
Validate access keys have appropriate permissions

Resume Not Working

Symptom: --resume flag doesn't continue from checkpoint Solutions:

Verify checkpoint file exists: ls /tmp/diff_checker_{source}_{dest}.checkpoint
Ensure bucket names match exactly
Check checkpoint file contents: cat /tmp/diff_checker_{source}_{dest}.checkpoint
For fresh start, run without --resume

Debug Mode

Enable detailed logging for troubleshooting:

./bucket_delta \
  --log_level DEBUG \
  # ... other parameters

This provides detailed information about:

API calls and responses
Object processing details
Queue status and worker activity
Performance metrics

Performance Characteristics

Benchmarks below were run with 500K objects (45 KB each), --num_connections 10, on the following test environment:

Resource	Specification
OS	Linux (Enterprise Linux 8)
CPU	8 cores (x86_64)
RAM	16 GB
Swap	2 GB
Average latency	3.971 ms (Host to Object Store endpoint)

Shallow Check

Shallow check runs as a single-process merge-diff — --num_processes does not affect its performance.

Execution Time	Throughput	Peak CPU	Peak RAM
153.18s	3,917 obj/s	~14.1%	~1.31 GB

Deep Check — Scaling with `--num_processes`

Processes	Execution Time	Throughput	Peak CPU	Peak RAM
10	699.14s	858 obj/s	~53.4%	~4.45 GB
15	518.77s	1,156.58 obj/s	~76.5%	~4.56 GB
20	468.93s	1,280 obj/s	~91.7%	~4.68 GB
25	445.98s	1,345 obj/s	~84.2%	~4.87 GB
50	478.60s	1,254 obj/s	~96.2%	~5.42 GB
75	498.30s	1,204 obj/s	~93.8%	~6.00 GB

Peak throughput is at ~25 processes. Beyond that, throughput decreases while CPU and memory usage continue to rise due to API rate limiting and process overhead.

Scaling Recommendations

Shallow check: Single-process; --num_processes has no effect. Performance is bounded by ListObjectsV2 pagination and endpoint latency.
Deep check: Start with 20–25 processes for best throughput-to-resource ratio. Going higher increases memory and CPU without improving throughput.
Memory-constrained environments: Reduce --num_processes and --max_queue_size (e.g., --num_processes 5 --max_queue_size 25).
Connection planning: Total connections per endpoint = num_processes × num_connections. Keep this within endpoint rate limits — high values (e.g., 20 x 50 = 1000) may cause throttling.
For very large buckets, run shallow check first, then deep check when detailed metadata validation is required.

License

The project is released under version 2.0 of the Apache license.

Contact & Support

Nutanix Team

Engineering Team: s3-bucket-delta@nutanix.com

Community Support

GitHub Issues: Report bugs and request features

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
diff_checker		diff_checker
LICENSE		LICENSE
README.md		README.md
bucket_delta.py		bucket_delta.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Bucket Delta

Introduction

Key Capabilities

Architecture Overview

Features

Core Functionality

Shallow Check Mode (Default)

Deep Check Mode

Performance Features

Multi-Processing Architecture

Advanced Configuration

Enterprise Features

Comprehensive Reporting

Security & Compliance

Quick Start

Prerequisites

Installation

Basic Usage

Advanced Usage Examples

Deep Metadata Comparison

Resuming After Failure

Configuration Reference

Required Parameters

Optional Parameters

Comparison Mode

Resumability

Performance Tuning

Logging

Output Format

Difference Report Structure

Log Output Example

Advanced Usage

Environment Variables

Batch Processing Script

Integration with CI/CD

Troubleshooting

Common Issues

High Memory Usage

Slow Performance

Connection Timeouts

Permission Errors

Resume Not Working

Debug Mode

Performance Characteristics

Shallow Check

Deep Check — Scaling with --num_processes

Scaling Recommendations

License

Contact & Support

Nutanix Team

Community Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Deep Check — Scaling with `--num_processes`

Packages