This is a simple Python script which verifies an md5sum of local files against the md5sum calculated by AWS when that file is uploaded to S3.
A common research use case is to store files in an AWS S3 bucket. Often this same bucket is made accessible to an ec2 server using File Gateway and/or SFTP clients can be enabled via Transfer Family, which also NFS mounts this S3 bucket.
To make an s3 bucket look like a regular NFS mount, it caches a lot of data on the File Gateway or Transfer Family server. CACHING IS SLOW. To calculate a checksum on an NFS mounted bucket like so:
$ md5sum /my-bucket/my-path/my-giant-dir/*
Requires that the cache load the contents of every file into the cache, just to calculate the checksum.
The md5sum is ALREADY calculated for you. The AWS S3 API calculates the md5sum after the file upload finishes on the AWS side. It stores that data in the "ETag" field of the S3Object. We can retrieve that very efficiently using the API.
$ git clone git@github.com:jcabraham/verify-s3-checksums.git
$ cd verify-s3-checksums
# install pipenv (if not done already)
$ pip install pipenv
# install everything in Pipfile
$ pipenv install
NOTE: boto3 here is configured to use your ~/.aws/credentials [default] profile. I leave parameterizing this as an exercise to the reader.
Do this however you wish, but I like using RCLONE to do transfers: it's extremely efficient and easy to configure. An example config:
To use RCLONE, install it via brew (Mac), yum/apt (Linux), or the web (Windows). Copy this example to the following path:
~/.config/rclone/rclone.conf
[my_bucket]
type = sftp
host = [my sftp host]
user = [my username]
key_file = [path_to_my_private_key]
set_modtime = false
NOTE: you must add
set_modtime = false
or you'll get errors.
Then you can do things like the following:
rclone sync ./my-giant-dir/ my_bucket:/my-giant-dir
The provided verify-checksums.py Python script will create a dictionary of md5sums for your local files and compare them to the md5sums (ETags) of the same files in the S3 bucket.
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
You can also use the saved output of md5sum as the local source:
$ cat ./local-checksums.txt
c157a79031e1c40f85931829bc5fc552 bar.txt
258622b1688250cb619f3c9ccaefb7eb baz.txt
d3b07384d113edec49eaa6238ad5ff00 foo.txt
$ python3 verify-checksums.py check-list --checksum-list local-checksums.txt --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
$ python3 verify-checksums.py --help
Usage: verify-checksums.py [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
check-directory Iterate over files in a directory, generate md5sum for...
check-list Read in a file list generated by md5sum and compare to...
$ python3 verify-checksums.py check-directory --help
Usage: verify-checksums.py check-directory [OPTIONS]
Iterate over files in a directory, generate md5sum for each and compare to
sum on s3
Options:
--local-path PATH Local directory path
--include TEXT Glob of filenames to include, e.g. "*.gz". Defaults to "*" (all)
--bucket TEXT Bucket name (e.g. "my-bucket", not "s3://my-bucket" [required]
--remote-path TEXT Bucket directory to check against, e.g. my-path/my-giant-dir [required]
--warn-missing Warn if a file is missing from the bucket.
--debug Turn on debugging.
--help Show this message and exit.
# Upload some test files with rclone
$ rclone sync ./my-giant-dir/ my_bucket:/my-giant-dir
#(no output, successful upload)
# Verify checksums
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
Verified 3 files, 0 errors.
# Verify checksums, turn on debugging info
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing --debug
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
Found credentials in shared credentials file: ~/.aws/credentials
OK baz.txt local: 258622b1688250cb619f3c9ccaefb7eb, s3: 258622b1688250cb619f3c9ccaefb7eb
OK foo.txt local: d3b07384d113edec49eaa6238ad5ff00, s3: d3b07384d113edec49eaa6238ad5ff00
OK bar.txt local: c157a79031e1c40f85931829bc5fc552, s3: c157a79031e1c40f85931829bc5fc552
Verified 3 files, 0 errors.
$ echo foo >> ./my-giant-dir/foo.txt
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing --debug
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
ERROR foo.txt local: 5fb7ba7e8447a836e774b66155f5776a s3: d3b07384d113edec49eaa6238ad5ff00
Verified 3 files, 1 errors.
# run md5sum manually, save to file
$ cd ./my-giant-dir
$ md5sum * >> ../local-checksums.txt
$ cd ..
$ cat ./local-checksums.txt
c157a79031e1c40f85931829bc5fc552 bar.txt
258622b1688250cb619f3c9ccaefb7eb baz.txt
d3b07384d113edec49eaa6238ad5ff00 foo.txt
# Verify info in file against s3
$ python3 verify-checksums.py check-list --checksum-list local-checksums.txt --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
Verifying <_io.TextIOWrapper name='local-checksums.txt' mode='r' encoding='UTF-8'> checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
ERROR foo.txt local: 5fb7ba7e8447a836e774b66155f5776a s3: d3b07384d113edec49eaa6238ad5ff00
Verified 3 files, 1 errors.