Skip to content

nutanix/bucket-delta-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bucket Delta

License Python S3 Compatible

A high-performance, enterprise-grade tool for identifying differences between S3-compatible buckets. Built by Nutanix for production workloads requiring reliable bucket synchronization verification, migration validation, and compliance auditing.

Introduction

The Bucket Delta Diff Checker is a Python-based utility designed to efficiently compare two S3-compatible buckets and identify discrepancies. Whether you're validating data migrations, ensuring backup integrity, or maintaining compliance across multi-cloud environments, this tool provides comprehensive bucket analysis with enterprise-grade performance and reliability.

Key Capabilities

  • Comprehensive Comparison: Detect missing objects and metadata differences
  • High Performance: Multi-process architecture with configurable concurrency for optimal throughput
  • Enterprise Ready: Built for production workloads with robust error handling and detailed logging
  • Detailed Reporting: Comprehensive difference reports with customizable output formats
  • Resumability: Checkpoint-based recovery allows resuming interrupted comparisons

Architecture Overview

graph TB
    subgraph "Input Sources"
        SRC["Source Bucket
S3 Compatible"]
        DST["Destination Bucket
S3 Compatible"]
    end

    subgraph "Bucket Delta Diff Checker"
        subgraph "Main Process"
            LST1["List Objects API
Source"]
            LST2["List Objects API
Destination"]
            MERGE["Merge Sort Algorithm
O(n+m) Complexity"]
            SHALLOW["Shallow Check
Basic Metadata"]
        end

        subgraph "Deep Check Pipeline"
            WQ["Work Queue
Configurable Size"]
            WP1[Worker Process 1]
            WP2[Worker Process 2]
            WPN[Worker Process N]
            DEEP["Deep Check
Tags, WORM, Custom Metadata"]
        end

        subgraph "Reporting"
            RQ[Report Queue]
            RP[Report Process]
            CP["Checkpoint File
/tmp/diff_checker_*.checkpoint"]
        end
    end

    subgraph "Output"
        RPT["Difference Report
Structured Output"]
        LOG["Execution Logs
Configurable Levels"]
        STATS["Performance Statistics
Throughput Metrics"]
    end

    SRC --> LST1
    DST --> LST2
    LST1 --> MERGE
    LST2 --> MERGE

    MERGE --> SHALLOW
    SHALLOW --> RQ
    SHALLOW --> WQ

    WQ --> WP1
    WQ --> WP2
    WQ --> WPN

    WP1 --> DEEP
    WP2 --> DEEP
    WPN --> DEEP

    DEEP --> RQ
    RQ --> RP
    RP --> RPT
    RP --> CP

    MERGE --> LOG
    MERGE --> STATS

    style SRC fill:#e1f5fe
    style CP fill:#ffe0b2
    style DST fill:#e1f5fe
    style MERGE fill:#fff3e0
    style SHALLOW fill:#f3e5f5
    style DEEP fill:#e8f5e8
    style RPT fill:#fff8e1
Loading

Features

  • Purpose-Built: Specifically designed for bucket comparison, not general file transfer
  • Deep Analysis: Goes beyond basic file comparison to examine S3-specific metadata
  • Scalable: Handles enterprise-scale buckets with millions of objects efficiently
  • Configurable: Extensive tuning options for different hardware and network conditions
  • Resumable: Checkpoint-based resumability with automatic progress saving to /tmp/diff_checker_{src_bucket}_{dst_bucket}.checkpoint. Optimal for long-running comparisons with millions of objects.

Core Functionality

Shallow Check Mode (Default)

  • Fast Object Enumeration: Quickly identifies missing objects between buckets
  • Basic Metadata Comparison: Compares ETag and size
  • Optimal for: Initial migration validation, quick consistency checks
  • S3 APIs used: ListObjectsV2 (on both source and destination buckets)

Deep Check Mode

  • Comprehensive Metadata Analysis: Examines object tags, custom metadata, and WORM settings
  • Advanced Comparison: Includes ObjectLock configuration and retention mode
  • Optimal for: Compliance auditing, detailed migration verification
  • S3 APIs used: ListObjectsV2 (on both buckets) + HeadObject and GetObjectTagging (per object, on both buckets)

Performance Features

Multi-Processing Architecture

graph LR
    subgraph "Producer"
        MAIN["Main Process
List & Compare"]
    end

    subgraph "Consumer Pool"
        W1["Worker 1
Deep Check"]
        W2["Worker 2
Deep Check"]
        WN["Worker N
Deep Check"]
    end

    subgraph "Reporting"
        RQ[Report Queue]
        REP["Report Process
Output Generation"]
    end

    MAIN -->|Work Queue| W1
    MAIN -->|Work Queue| W2
    MAIN -->|Work Queue| WN

    W1 --> RQ
    W2 --> RQ
    WN --> RQ
    RQ --> REP

    style MAIN fill:#e3f2fd
    style W1 fill:#f1f8e9
    style W2 fill:#f1f8e9
    style WN fill:#f1f8e9
    style RQ fill:#fff3e0
    style REP fill:#fff3e0
Loading

Advanced Configuration

  • Configurable Worker Processes: Scale processing based on available CPU cores
  • Connection Pool Management: Optimize network utilization with configurable connection limits
  • Memory Management: Queue size limits prevent memory exhaustion on large datasets
  • Progress Reporting: Real-time throughput metrics and completion estimates

Enterprise Features

Comprehensive Reporting

  • Structured Output: Machine-readable difference reports
  • Performance Metrics: Execution time, throughput, and processing statistics
  • Detailed Logging: Configurable log levels for debugging and monitoring

Security & Compliance

  • Secure Credential Handling: Support for access key and secret key authentication
  • Audit Trail: Detailed logs suitable for compliance reporting
  • Error Handling: Robust error recovery and reporting mechanisms

Quick Start

Prerequisites

  • Python 3.7+ with pip
  • Network Access to both source and destination S3 endpoints
  • Credentials with read access to both buckets

Installation

To use the bucket_delta tool, follow these steps:

  1. Download and install pip on the linux vm:

    curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python3 get-pip.py
  2. Install pre-requisites using:

    sudo pip3 install -r requirements.txt
  3. Run the tool:

    Option A: Run directly from source:

    python3 bucket_delta.py --help

    Option B: Build standalone executable:

    # Install PyInstaller
    pip3 install pyinstaller
    
    
    # Option 2: Using PyInstaller directly
    python3 -m PyInstaller --onefile bucket_delta.py --name bucket_delta --hidden-import=ConfigParser

Basic Usage

# Simple bucket comparison
python3 bucket_delta.py \
  --source_bucket_name my-source-bucket \
  --source_endpoint_url https://s3.amazonaws.com \
  --source_access_key AKIA... \
  --source_secret_key secret... \
  --destination_bucket_name my-dest-bucket \
  --destination_endpoint_url https://s3.amazonaws.com \
  --destination_access_key AKIA... \
  --destination_secret_key secret... \
  --output_file differences.txt

Advanced Usage Examples

Deep Metadata Comparison

python3 bucket_delta.py \
  --source_bucket_name production-data \
  --source_endpoint_url https://s3.us-west-2.amazonaws.com \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_bucket_name backup-data \
  --destination_endpoint_url https://backup.company.com \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  --output_file detailed_differences.txt \
  --shallow_check false \
  --compare_object_tags true \
  --num_processes 20 \
  --num_connections 10 \
  --log_level DEBUG

Resuming After Failure

# If a previous run was interrupted, resume from checkpoint
python3 bucket_delta.py \
  --source_bucket_name production-data \
  --source_endpoint_url https://s3.us-west-2.amazonaws.com \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_bucket_name backup-data \
  --destination_endpoint_url https://backup.company.com \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  --output_file detailed_differences.txt \
  --resume

The tool automatically saves checkpoints every --num_objects_report objects (default: 1000) to /tmp/diff_checker_{source_bucket}_{dest_bucket}.checkpoint. On successful completion, the checkpoint file is automatically deleted.

Configuration Reference

Required Parameters

Parameter Description Example
--source_bucket_name Name of the source bucket my-source-bucket
--source_endpoint_url S3 endpoint URL for source https://s3.amazonaws.com
--source_access_key Access key for source bucket AKIAIOSFODNN7EXAMPLE
--source_secret_key Secret key for source bucket wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
--destination_bucket_name Name of the destination bucket my-dest-bucket
--destination_endpoint_url S3 endpoint URL for destination https://s3.amazonaws.com
--destination_access_key Access key for destination bucket AKIAIOSFODNN7EXAMPLE
--destination_secret_key Secret key for destination bucket wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
--output_file Path to output differences file ./differences.txt

Optional Parameters

Comparison Mode

Parameter Type Default Description
--shallow_check boolean true Enable shallow check mode (basic metadata only)
--compare_object_tags boolean true Compare object tags during deep checks

Resumability

Parameter Type Default Description
--resume flag false Resume from last checkpoint if available

Performance Tuning

Parameter Type Default Description
--num_processes integer 10 Number of worker processes for parallel processing
--num_connections integer 10 Number of HTTP connections per S3 client
--max_queue_size integer 100 Maximum items in work queue (memory management)
--num_objects_report integer 1000 Progress reporting interval (objects processed)

Note: Each worker process creates its own S3 client with num_connections connections. The total number of concurrent HTTP connections to each endpoint is num_processes × num_connections (e.g., 10 processes × 10 connections = 100 connections per endpoint).

Logging

Parameter Type Default Options Description
--log_level string INFO DEBUG, WARN, INFO, ERROR Logging verbosity level

Output Format

Difference Report Structure

The tool generates a structured text file with the following format:

Only in source bucket my-source-bucket, Differences: ['Key: file1.txt', 'ETag: abc123', 'Size: 1024 bytes']
Only in destination bucket my-dest-bucket, Differences: ['Key: file2.txt', 'ETag: def456', 'Size: 2048 bytes']
Difference in Key: file3.txt, Differences: ['ETag differs', 'Size differs']
Difference in Key: file4.txt, Differences: ['Tags differ', 'ObjectLockMode differs']

Log Output Example

[2025-01-15 10:30:15 - INFO] - Validating input arguments
[2025-01-15 10:30:15 - INFO] - Input arguments validated successfully
[2025-01-15 10:30:15 - INFO] - Log level set to INFO
[2025-01-15 10:30:16 - INFO] - Listing all objects from source bucket: my-source-bucket
[2025-01-15 10:30:16 - INFO] - Listing all objects from destination bucket: my-dest-bucket
[2025-01-15 10:30:16 - INFO] - Starting bucket comparison
[2025-01-15 10:30:16 - INFO] - Mode: latest objects only
[2025-01-15 10:30:25 - INFO] - Processed 1000 objects in 8.45 seconds, total elapsed 8.45 seconds
[2025-01-15 10:30:35 - INFO] - Processed 2000 objects in 9.12 seconds, total elapsed 17.57 seconds
[2025-01-15 10:30:42 - INFO] - ------Final report------
[2025-01-15 10:30:42 - INFO] - Total objects processed: 2543
[2025-01-15 10:30:42 - INFO] - Total objects with differences: 15
[2025-01-15 10:30:42 - INFO] - Total execution time: 25.83 seconds
[2025-01-15 10:30:42 - INFO] - Throughput: 98.45 objects/second
[2025-01-15 10:30:42 - INFO] - ------End of report------

Advanced Usage

Environment Variables

For enhanced security, credentials can be provided via environment variables:

export SRC_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export SRC_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export DST_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export DST_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

python3 bucket_delta.py \
  --source_access_key $SRC_ACCESS_KEY \
  --source_secret_key $SRC_SECRET_KEY \
  --destination_access_key $DST_ACCESS_KEY \
  --destination_secret_key $DST_SECRET_KEY \
  # ... other parameters

Batch Processing Script

#!/bin/bash
# batch_compare.sh - Compare multiple bucket pairs

BUCKET_PAIRS=(
  "prod-data:backup-data"
  "user-uploads:user-backup"
  "analytics:analytics-backup"
)

for pair in "${BUCKET_PAIRS[@]}"; do
  IFS=':' read -r src dst <<< "$pair"
  echo "Comparing $src -> $dst"

  python3 bucket_delta.py \
    --source_bucket_name "$src" \
    --destination_bucket_name "$dst" \
    --source_endpoint_url https://s3.amazonaws.com \
    --destination_endpoint_url https://backup.company.com \
    --source_access_key $SRC_ACCESS_KEY \
    --source_secret_key $SRC_SECRET_KEY \
    --destination_access_key $DST_ACCESS_KEY \
    --destination_secret_key $DST_SECRET_KEY \
    --output_file "differences_${src}_${dst}.txt" \
    --log_level INFO
done

Integration with CI/CD

# .github/workflows/bucket-validation.yml
name: Bucket Validation
on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  validate-backups:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Validate Production Backup
        run: |
          python3 bucket_delta.py \
            --source_bucket_name production-data \
            --destination_bucket_name backup-data \
            --source_endpoint_url ${{ secrets.PROD_S3_ENDPOINT }} \
            --destination_endpoint_url ${{ secrets.BACKUP_S3_ENDPOINT }} \
            --source_access_key ${{ secrets.PROD_ACCESS_KEY }} \
            --source_secret_key ${{ secrets.PROD_SECRET_KEY }} \
            --destination_access_key ${{ secrets.BACKUP_ACCESS_KEY }} \
            --destination_secret_key ${{ secrets.BACKUP_SECRET_KEY }} \
            --output_file backup_validation.txt \
            --shallow_check false

      - name: Upload Results
        uses: actions/upload-artifact@v3
        with:
          name: validation-results
          path: backup_validation.txt

Troubleshooting

Common Issues

High Memory Usage

Symptom: Process consumes excessive memory Solution: Reduce --max_queue_size and --num_processes

--max_queue_size 25 --num_processes 5

Slow Performance

Symptom: Low throughput (< 10 objects/second) Solutions:

  1. Increase concurrency: --num_processes 20 --num_connections 50
  2. Use shallow check for initial validation: --shallow_check true
  3. Reduce reporting frequency: --num_objects_report 5000

Connection Timeouts

Symptom: Frequent network errors Solutions:

  1. Reduce connection count: --num_connections 10
  2. Check network connectivity to S3 endpoints

Permission Errors

Symptom: Access denied errors Solutions:

  1. Verify bucket permissions: s3:ListBucket, s3:GetObject, s3:GetObjectTagging
  2. Check endpoint URLs are correct
  3. Validate access keys have appropriate permissions

Resume Not Working

Symptom: --resume flag doesn't continue from checkpoint Solutions:

  1. Verify checkpoint file exists: ls /tmp/diff_checker_{source}_{dest}.checkpoint
  2. Ensure bucket names match exactly
  3. Check checkpoint file contents: cat /tmp/diff_checker_{source}_{dest}.checkpoint
  4. For fresh start, run without --resume

Debug Mode

Enable detailed logging for troubleshooting:

./bucket_delta \
  --log_level DEBUG \
  # ... other parameters

This provides detailed information about:

  • API calls and responses
  • Object processing details
  • Queue status and worker activity
  • Performance metrics

Performance Characteristics

Benchmarks below were run with 500K objects (45 KB each), --num_connections 10, on the following test environment:

Resource Specification
OS Linux (Enterprise Linux 8)
CPU 8 cores (x86_64)
RAM 16 GB
Swap 2 GB
Average latency 3.971 ms (Host to Object Store endpoint)

Shallow Check

Shallow check runs as a single-process merge-diff — --num_processes does not affect its performance.

Execution Time Throughput Peak CPU Peak RAM
153.18s 3,917 obj/s ~14.1% ~1.31 GB

Deep Check — Scaling with --num_processes

Processes Execution Time Throughput Peak CPU Peak RAM
10 699.14s 858 obj/s ~53.4% ~4.45 GB
15 518.77s 1,156.58 obj/s ~76.5% ~4.56 GB
20 468.93s 1,280 obj/s ~91.7% ~4.68 GB
25 445.98s 1,345 obj/s ~84.2% ~4.87 GB
50 478.60s 1,254 obj/s ~96.2% ~5.42 GB
75 498.30s 1,204 obj/s ~93.8% ~6.00 GB

Peak throughput is at ~25 processes. Beyond that, throughput decreases while CPU and memory usage continue to rise due to API rate limiting and process overhead.

Scaling Recommendations

  • Shallow check: Single-process; --num_processes has no effect. Performance is bounded by ListObjectsV2 pagination and endpoint latency.
  • Deep check: Start with 20–25 processes for best throughput-to-resource ratio. Going higher increases memory and CPU without improving throughput.
  • Memory-constrained environments: Reduce --num_processes and --max_queue_size (e.g., --num_processes 5 --max_queue_size 25).
  • Connection planning: Total connections per endpoint = num_processes × num_connections. Keep this within endpoint rate limits — high values (e.g., 20 x 50 = 1000) may cause throttling.
  • For very large buckets, run shallow check first, then deep check when detailed metadata validation is required.

License

The project is released under version 2.0 of the Apache license.

Contact & Support

Nutanix Team

Community Support


About

A high-performance, enterprise-grade tool for identifying differences between S3-compatible buckets. Built by Nutanix for production workloads requiring reliable bucket synchronization verification, migration validation, and compliance auditing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages