A high-performance, enterprise-grade tool for identifying differences between S3-compatible buckets. Built by Nutanix for production workloads requiring reliable bucket synchronization verification, migration validation, and compliance auditing.
The Bucket Delta Diff Checker is a Python-based utility designed to efficiently compare two S3-compatible buckets and identify discrepancies. Whether you're validating data migrations, ensuring backup integrity, or maintaining compliance across multi-cloud environments, this tool provides comprehensive bucket analysis with enterprise-grade performance and reliability.
- Comprehensive Comparison: Detect missing objects and metadata differences
- High Performance: Multi-process architecture with configurable concurrency for optimal throughput
- Enterprise Ready: Built for production workloads with robust error handling and detailed logging
- Detailed Reporting: Comprehensive difference reports with customizable output formats
- Resumability: Checkpoint-based recovery allows resuming interrupted comparisons
graph TB
subgraph "Input Sources"
SRC["Source Bucket
S3 Compatible"]
DST["Destination Bucket
S3 Compatible"]
end
subgraph "Bucket Delta Diff Checker"
subgraph "Main Process"
LST1["List Objects API
Source"]
LST2["List Objects API
Destination"]
MERGE["Merge Sort Algorithm
O(n+m) Complexity"]
SHALLOW["Shallow Check
Basic Metadata"]
end
subgraph "Deep Check Pipeline"
WQ["Work Queue
Configurable Size"]
WP1[Worker Process 1]
WP2[Worker Process 2]
WPN[Worker Process N]
DEEP["Deep Check
Tags, WORM, Custom Metadata"]
end
subgraph "Reporting"
RQ[Report Queue]
RP[Report Process]
CP["Checkpoint File
/tmp/diff_checker_*.checkpoint"]
end
end
subgraph "Output"
RPT["Difference Report
Structured Output"]
LOG["Execution Logs
Configurable Levels"]
STATS["Performance Statistics
Throughput Metrics"]
end
SRC --> LST1
DST --> LST2
LST1 --> MERGE
LST2 --> MERGE
MERGE --> SHALLOW
SHALLOW --> RQ
SHALLOW --> WQ
WQ --> WP1
WQ --> WP2
WQ --> WPN
WP1 --> DEEP
WP2 --> DEEP
WPN --> DEEP
DEEP --> RQ
RQ --> RP
RP --> RPT
RP --> CP
MERGE --> LOG
MERGE --> STATS
style SRC fill:#e1f5fe
style CP fill:#ffe0b2
style DST fill:#e1f5fe
style MERGE fill:#fff3e0
style SHALLOW fill:#f3e5f5
style DEEP fill:#e8f5e8
style RPT fill:#fff8e1
- Purpose-Built: Specifically designed for bucket comparison, not general file transfer
- Deep Analysis: Goes beyond basic file comparison to examine S3-specific metadata
- Scalable: Handles enterprise-scale buckets with millions of objects efficiently
- Configurable: Extensive tuning options for different hardware and network conditions
- Resumable: Checkpoint-based resumability with automatic progress saving to
/tmp/diff_checker_{src_bucket}_{dst_bucket}.checkpoint. Optimal for long-running comparisons with millions of objects.
- Fast Object Enumeration: Quickly identifies missing objects between buckets
- Basic Metadata Comparison: Compares ETag and size
- Optimal for: Initial migration validation, quick consistency checks
- S3 APIs used:
ListObjectsV2(on both source and destination buckets)
- Comprehensive Metadata Analysis: Examines object tags, custom metadata, and WORM settings
- Advanced Comparison: Includes ObjectLock configuration and retention mode
- Optimal for: Compliance auditing, detailed migration verification
- S3 APIs used:
ListObjectsV2(on both buckets) +HeadObjectandGetObjectTagging(per object, on both buckets)
graph LR
subgraph "Producer"
MAIN["Main Process
List & Compare"]
end
subgraph "Consumer Pool"
W1["Worker 1
Deep Check"]
W2["Worker 2
Deep Check"]
WN["Worker N
Deep Check"]
end
subgraph "Reporting"
RQ[Report Queue]
REP["Report Process
Output Generation"]
end
MAIN -->|Work Queue| W1
MAIN -->|Work Queue| W2
MAIN -->|Work Queue| WN
W1 --> RQ
W2 --> RQ
WN --> RQ
RQ --> REP
style MAIN fill:#e3f2fd
style W1 fill:#f1f8e9
style W2 fill:#f1f8e9
style WN fill:#f1f8e9
style RQ fill:#fff3e0
style REP fill:#fff3e0
- Configurable Worker Processes: Scale processing based on available CPU cores
- Connection Pool Management: Optimize network utilization with configurable connection limits
- Memory Management: Queue size limits prevent memory exhaustion on large datasets
- Progress Reporting: Real-time throughput metrics and completion estimates
- Structured Output: Machine-readable difference reports
- Performance Metrics: Execution time, throughput, and processing statistics
- Detailed Logging: Configurable log levels for debugging and monitoring
- Secure Credential Handling: Support for access key and secret key authentication
- Audit Trail: Detailed logs suitable for compliance reporting
- Error Handling: Robust error recovery and reporting mechanisms
- Python 3.7+ with pip
- Network Access to both source and destination S3 endpoints
- Credentials with read access to both buckets
To use the bucket_delta tool, follow these steps:
-
Download and install pip on the linux vm:
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python3 get-pip.py
-
Install pre-requisites using:
sudo pip3 install -r requirements.txt
-
Run the tool:
Option A: Run directly from source:
python3 bucket_delta.py --help
Option B: Build standalone executable:
# Install PyInstaller pip3 install pyinstaller # Option 2: Using PyInstaller directly python3 -m PyInstaller --onefile bucket_delta.py --name bucket_delta --hidden-import=ConfigParser
# Simple bucket comparison
python3 bucket_delta.py \
--source_bucket_name my-source-bucket \
--source_endpoint_url https://s3.amazonaws.com \
--source_access_key AKIA... \
--source_secret_key secret... \
--destination_bucket_name my-dest-bucket \
--destination_endpoint_url https://s3.amazonaws.com \
--destination_access_key AKIA... \
--destination_secret_key secret... \
--output_file differences.txtpython3 bucket_delta.py \
--source_bucket_name production-data \
--source_endpoint_url https://s3.us-west-2.amazonaws.com \
--source_access_key $SRC_ACCESS_KEY \
--source_secret_key $SRC_SECRET_KEY \
--destination_bucket_name backup-data \
--destination_endpoint_url https://backup.company.com \
--destination_access_key $DST_ACCESS_KEY \
--destination_secret_key $DST_SECRET_KEY \
--output_file detailed_differences.txt \
--shallow_check false \
--compare_object_tags true \
--num_processes 20 \
--num_connections 10 \
--log_level DEBUG# If a previous run was interrupted, resume from checkpoint
python3 bucket_delta.py \
--source_bucket_name production-data \
--source_endpoint_url https://s3.us-west-2.amazonaws.com \
--source_access_key $SRC_ACCESS_KEY \
--source_secret_key $SRC_SECRET_KEY \
--destination_bucket_name backup-data \
--destination_endpoint_url https://backup.company.com \
--destination_access_key $DST_ACCESS_KEY \
--destination_secret_key $DST_SECRET_KEY \
--output_file detailed_differences.txt \
--resumeThe tool automatically saves checkpoints every --num_objects_report objects (default: 1000) to /tmp/diff_checker_{source_bucket}_{dest_bucket}.checkpoint. On successful completion, the checkpoint file is automatically deleted.
| Parameter | Description | Example |
|---|---|---|
--source_bucket_name |
Name of the source bucket | my-source-bucket |
--source_endpoint_url |
S3 endpoint URL for source | https://s3.amazonaws.com |
--source_access_key |
Access key for source bucket | AKIAIOSFODNN7EXAMPLE |
--source_secret_key |
Secret key for source bucket | wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY |
--destination_bucket_name |
Name of the destination bucket | my-dest-bucket |
--destination_endpoint_url |
S3 endpoint URL for destination | https://s3.amazonaws.com |
--destination_access_key |
Access key for destination bucket | AKIAIOSFODNN7EXAMPLE |
--destination_secret_key |
Secret key for destination bucket | wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY |
--output_file |
Path to output differences file | ./differences.txt |
| Parameter | Type | Default | Description |
|---|---|---|---|
--shallow_check |
boolean | true |
Enable shallow check mode (basic metadata only) |
--compare_object_tags |
boolean | true |
Compare object tags during deep checks |
| Parameter | Type | Default | Description |
|---|---|---|---|
--resume |
flag | false |
Resume from last checkpoint if available |
| Parameter | Type | Default | Description |
|---|---|---|---|
--num_processes |
integer | 10 |
Number of worker processes for parallel processing |
--num_connections |
integer | 10 |
Number of HTTP connections per S3 client |
--max_queue_size |
integer | 100 |
Maximum items in work queue (memory management) |
--num_objects_report |
integer | 1000 |
Progress reporting interval (objects processed) |
Note: Each worker process creates its own S3 client with
num_connectionsconnections. The total number of concurrent HTTP connections to each endpoint isnum_processes × num_connections(e.g., 10 processes × 10 connections = 100 connections per endpoint).
| Parameter | Type | Default | Options | Description |
|---|---|---|---|---|
--log_level |
string | INFO |
DEBUG, WARN, INFO, ERROR |
Logging verbosity level |
The tool generates a structured text file with the following format:
Only in source bucket my-source-bucket, Differences: ['Key: file1.txt', 'ETag: abc123', 'Size: 1024 bytes']
Only in destination bucket my-dest-bucket, Differences: ['Key: file2.txt', 'ETag: def456', 'Size: 2048 bytes']
Difference in Key: file3.txt, Differences: ['ETag differs', 'Size differs']
Difference in Key: file4.txt, Differences: ['Tags differ', 'ObjectLockMode differs']
[2025-01-15 10:30:15 - INFO] - Validating input arguments
[2025-01-15 10:30:15 - INFO] - Input arguments validated successfully
[2025-01-15 10:30:15 - INFO] - Log level set to INFO
[2025-01-15 10:30:16 - INFO] - Listing all objects from source bucket: my-source-bucket
[2025-01-15 10:30:16 - INFO] - Listing all objects from destination bucket: my-dest-bucket
[2025-01-15 10:30:16 - INFO] - Starting bucket comparison
[2025-01-15 10:30:16 - INFO] - Mode: latest objects only
[2025-01-15 10:30:25 - INFO] - Processed 1000 objects in 8.45 seconds, total elapsed 8.45 seconds
[2025-01-15 10:30:35 - INFO] - Processed 2000 objects in 9.12 seconds, total elapsed 17.57 seconds
[2025-01-15 10:30:42 - INFO] - ------Final report------
[2025-01-15 10:30:42 - INFO] - Total objects processed: 2543
[2025-01-15 10:30:42 - INFO] - Total objects with differences: 15
[2025-01-15 10:30:42 - INFO] - Total execution time: 25.83 seconds
[2025-01-15 10:30:42 - INFO] - Throughput: 98.45 objects/second
[2025-01-15 10:30:42 - INFO] - ------End of report------
For enhanced security, credentials can be provided via environment variables:
export SRC_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export SRC_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export DST_ACCESS_KEY="AKIAIOSFODNN7EXAMPLE"
export DST_SECRET_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
python3 bucket_delta.py \
--source_access_key $SRC_ACCESS_KEY \
--source_secret_key $SRC_SECRET_KEY \
--destination_access_key $DST_ACCESS_KEY \
--destination_secret_key $DST_SECRET_KEY \
# ... other parameters#!/bin/bash
# batch_compare.sh - Compare multiple bucket pairs
BUCKET_PAIRS=(
"prod-data:backup-data"
"user-uploads:user-backup"
"analytics:analytics-backup"
)
for pair in "${BUCKET_PAIRS[@]}"; do
IFS=':' read -r src dst <<< "$pair"
echo "Comparing $src -> $dst"
python3 bucket_delta.py \
--source_bucket_name "$src" \
--destination_bucket_name "$dst" \
--source_endpoint_url https://s3.amazonaws.com \
--destination_endpoint_url https://backup.company.com \
--source_access_key $SRC_ACCESS_KEY \
--source_secret_key $SRC_SECRET_KEY \
--destination_access_key $DST_ACCESS_KEY \
--destination_secret_key $DST_SECRET_KEY \
--output_file "differences_${src}_${dst}.txt" \
--log_level INFO
done# .github/workflows/bucket-validation.yml
name: Bucket Validation
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
validate-backups:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Validate Production Backup
run: |
python3 bucket_delta.py \
--source_bucket_name production-data \
--destination_bucket_name backup-data \
--source_endpoint_url ${{ secrets.PROD_S3_ENDPOINT }} \
--destination_endpoint_url ${{ secrets.BACKUP_S3_ENDPOINT }} \
--source_access_key ${{ secrets.PROD_ACCESS_KEY }} \
--source_secret_key ${{ secrets.PROD_SECRET_KEY }} \
--destination_access_key ${{ secrets.BACKUP_ACCESS_KEY }} \
--destination_secret_key ${{ secrets.BACKUP_SECRET_KEY }} \
--output_file backup_validation.txt \
--shallow_check false
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: validation-results
path: backup_validation.txtSymptom: Process consumes excessive memory
Solution: Reduce --max_queue_size and --num_processes
--max_queue_size 25 --num_processes 5Symptom: Low throughput (< 10 objects/second) Solutions:
- Increase concurrency:
--num_processes 20 --num_connections 50 - Use shallow check for initial validation:
--shallow_check true - Reduce reporting frequency:
--num_objects_report 5000
Symptom: Frequent network errors Solutions:
- Reduce connection count:
--num_connections 10 - Check network connectivity to S3 endpoints
Symptom: Access denied errors Solutions:
- Verify bucket permissions:
s3:ListBucket,s3:GetObject,s3:GetObjectTagging - Check endpoint URLs are correct
- Validate access keys have appropriate permissions
Symptom: --resume flag doesn't continue from checkpoint
Solutions:
- Verify checkpoint file exists:
ls /tmp/diff_checker_{source}_{dest}.checkpoint - Ensure bucket names match exactly
- Check checkpoint file contents:
cat /tmp/diff_checker_{source}_{dest}.checkpoint - For fresh start, run without
--resume
Enable detailed logging for troubleshooting:
./bucket_delta \
--log_level DEBUG \
# ... other parametersThis provides detailed information about:
- API calls and responses
- Object processing details
- Queue status and worker activity
- Performance metrics
Benchmarks below were run with 500K objects (45 KB each), --num_connections 10, on the following test environment:
| Resource | Specification |
|---|---|
| OS | Linux (Enterprise Linux 8) |
| CPU | 8 cores (x86_64) |
| RAM | 16 GB |
| Swap | 2 GB |
| Average latency | 3.971 ms (Host to Object Store endpoint) |
Shallow check runs as a single-process merge-diff — --num_processes does not affect its performance.
| Execution Time | Throughput | Peak CPU | Peak RAM |
|---|---|---|---|
| 153.18s | 3,917 obj/s | ~14.1% | ~1.31 GB |
| Processes | Execution Time | Throughput | Peak CPU | Peak RAM |
|---|---|---|---|---|
| 10 | 699.14s | 858 obj/s | ~53.4% | ~4.45 GB |
| 15 | 518.77s | 1,156.58 obj/s | ~76.5% | ~4.56 GB |
| 20 | 468.93s | 1,280 obj/s | ~91.7% | ~4.68 GB |
| 25 | 445.98s | 1,345 obj/s | ~84.2% | ~4.87 GB |
| 50 | 478.60s | 1,254 obj/s | ~96.2% | ~5.42 GB |
| 75 | 498.30s | 1,204 obj/s | ~93.8% | ~6.00 GB |
Peak throughput is at ~25 processes. Beyond that, throughput decreases while CPU and memory usage continue to rise due to API rate limiting and process overhead.
- Shallow check: Single-process;
--num_processeshas no effect. Performance is bounded byListObjectsV2pagination and endpoint latency. - Deep check: Start with 20–25 processes for best throughput-to-resource ratio. Going higher increases memory and CPU without improving throughput.
- Memory-constrained environments: Reduce
--num_processesand--max_queue_size(e.g.,--num_processes 5 --max_queue_size 25). - Connection planning: Total connections per endpoint =
num_processes × num_connections. Keep this within endpoint rate limits — high values (e.g., 20 x 50 = 1000) may cause throttling. - For very large buckets, run shallow check first, then deep check when detailed metadata validation is required.
The project is released under version 2.0 of the Apache license.
- Engineering Team: s3-bucket-delta@nutanix.com
- GitHub Issues: Report bugs and request features