Skip to content

jcabraham/verify-s3-checksums

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verify Checksums

This is a simple Python script which verifies an md5sum of local files against the md5sum calculated by AWS when that file is uploaded to S3.

The Bad News

A common research use case is to store files in an AWS S3 bucket. Often this same bucket is made accessible to an ec2 server using File Gateway and/or SFTP clients can be enabled via Transfer Family, which also NFS mounts this S3 bucket.

To make an s3 bucket look like a regular NFS mount, it caches a lot of data on the File Gateway or Transfer Family server. CACHING IS SLOW. To calculate a checksum on an NFS mounted bucket like so:

$ md5sum /my-bucket/my-path/my-giant-dir/*

Requires that the cache load the contents of every file into the cache, just to calculate the checksum.

The Good News

The md5sum is ALREADY calculated for you. The AWS S3 API calculates the md5sum after the file upload finishes on the AWS side. It stores that data in the "ETag" field of the S3Object. We can retrieve that very efficiently using the API.

Suggested Workflow

0. Install this thing

$ git clone git@github.com:jcabraham/verify-s3-checksums.git
$ cd verify-s3-checksums

# install pipenv (if not done already)
$ pip install pipenv
# install everything in Pipfile
$ pipenv install

NOTE: boto3 here is configured to use your ~/.aws/credentials [default] profile. I leave parameterizing this as an exercise to the reader.

1. Transfer your files

Do this however you wish, but I like using RCLONE to do transfers: it's extremely efficient and easy to configure. An example config:

To use RCLONE, install it via brew (Mac), yum/apt (Linux), or the web (Windows). Copy this example to the following path:

~/.config/rclone/rclone.conf
[my_bucket]
type = sftp
host = [my sftp host]
user = [my username]
key_file = [path_to_my_private_key]
set_modtime = false

NOTE: you must add

set_modtime = false

or you'll get errors.

Then you can do things like the following:

rclone sync ./my-giant-dir/ my_bucket:/my-giant-dir

2. Verify your upload checksums

The provided verify-checksums.py Python script will create a dictionary of md5sums for your local files and compare them to the md5sums (ETags) of the same files in the S3 bucket.

$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing

You can also use the saved output of md5sum as the local source:

$ cat ./local-checksums.txt
c157a79031e1c40f85931829bc5fc552  bar.txt
258622b1688250cb619f3c9ccaefb7eb  baz.txt
d3b07384d113edec49eaa6238ad5ff00  foo.txt

$ python3 verify-checksums.py check-list --checksum-list local-checksums.txt --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing

Examples

Getting General Help

$ python3 verify-checksums.py --help
Usage: verify-checksums.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  check-directory  Iterate over files in a directory, generate md5sum for...
  check-list       Read in a file list generated by md5sum and compare to...

Help on a specific command

$ python3 verify-checksums.py check-directory --help
Usage: verify-checksums.py check-directory [OPTIONS]

  Iterate over files in a directory, generate md5sum for each and compare to
  sum on s3

Options:
  --local-path PATH   Local directory path
  --include TEXT      Glob of filenames to include, e.g. "*.gz". Defaults to "*" (all)
  --bucket TEXT       Bucket name (e.g. "my-bucket", not "s3://my-bucket"  [required]
  --remote-path TEXT  Bucket directory to check against, e.g. my-path/my-giant-dir [required]
  --warn-missing      Warn if a file is missing from the bucket.
  --debug             Turn on debugging.
  --help              Show this message and exit.

Check a directory (everything correct)

# Upload some test files with rclone
$ rclone sync ./my-giant-dir/ my_bucket:/my-giant-dir

#(no output, successful upload)

# Verify checksums
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
Verified 3 files, 0 errors.

# Verify checksums, turn on debugging info
$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing --debug
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
Found credentials in shared credentials file: ~/.aws/credentials
OK baz.txt local: 258622b1688250cb619f3c9ccaefb7eb, s3: 258622b1688250cb619f3c9ccaefb7eb
OK foo.txt local: d3b07384d113edec49eaa6238ad5ff00, s3: d3b07384d113edec49eaa6238ad5ff00
OK bar.txt local: c157a79031e1c40f85931829bc5fc552, s3: c157a79031e1c40f85931829bc5fc552
Verified 3 files, 0 errors.

Modify a local file, check directory again

$ echo foo >> ./my-giant-dir/foo.txt

$ python3 verify-checksums.py check-directory --local-path ./my-giant-dir --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing --debug
Verifying ./my-giant-dir/* checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
ERROR foo.txt local: 5fb7ba7e8447a836e774b66155f5776a s3: d3b07384d113edec49eaa6238ad5ff00
Verified 3 files, 1 errors.

Check against a file previously generated by md5sum or rclone

# run md5sum manually, save to file
$ cd ./my-giant-dir
$ md5sum * >> ../local-checksums.txt
$ cd ..
$ cat ./local-checksums.txt
c157a79031e1c40f85931829bc5fc552  bar.txt
258622b1688250cb619f3c9ccaefb7eb  baz.txt
d3b07384d113edec49eaa6238ad5ff00  foo.txt

# Verify info in file against s3
$ python3 verify-checksums.py check-list --checksum-list local-checksums.txt --bucket my-bucket --remote-path my-path/my-giant-dir/ --warn-missing
Verifying <_io.TextIOWrapper name='local-checksums.txt' mode='r' encoding='UTF-8'> checksums against s3://my-bucket/my-path/my-giant-dir/
Warn if files missing on s3: True
ERROR foo.txt local: 5fb7ba7e8447a836e774b66155f5776a s3: d3b07384d113edec49eaa6238ad5ff00
Verified 3 files, 1 errors.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages