Bsweger/switch hubverse aws sync to rclone #13

bsweger · 2024-03-08T19:46:17Z

See individual commit messages for additional details:

bump actions/checkout version
swap the utility we use for S3 syncing

Output of an example run that shows syncing the contents of a new model-auxiliary folder but ignoring content that has already been synced to S3 (e.g., model-output)

Run hub_directories=(
2024/03/12 19:29:57 INFO  : data.csv: Copied (new)
2024/03/12 19:29:57 INFO  : data.txt: Copied (new)
2024/03/12 19:29:57 INFO  :     2.654 KiB / 2.654 KiB, 100%, 0 B/s, ETA -
2024/03/12 19:29:57 INFO  : There was nothing to transfer
2024/03/12 19:29:57 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:58 INFO  : There was nothing to transfer
2024/03/12 19:29:58 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:58 INFO  : There was nothing to transfer
2024/03/12 19:29:58 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:59 INFO  : There was nothing to transfer
2024/03/12 19:29:59 INFO  :           0 B / 0 B, -, 0 B/s, ETA -

Output of an example run that includes a file rename and a file delete.

2024/03/12 19:38:17 INFO  : cool-data.csv: Copied (new)
2024/03/12 19:38:17 INFO  : data.csv: Deleted
2024/03/12 19:38:17 INFO  : data.txt: Deleted
2024/03/12 19:38:17 INFO  :     2.630 KiB / 2.630 KiB, 100%, 0 B/s, ETA -
2024/03/12 19:38:17 INFO  : There was nothing to transfer
2024/03/12 19:38:17 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:17 INFO  : There was nothing to transfer
2024/03/12 19:38:17 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:18 INFO  : There was nothing to transfer
2024/03/12 19:38:18 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:18 INFO  : There was nothing to transfer
2024/03/12 19:38:18 INFO  :           0 B / 0 B, -, 0 B/s, ETA -

Resolves hubverse-org/hubverse-cloud#36 Use rclone for cloud syncing because it has a checksum option for detecting file differences (whereas AWS SYNC relies on update date and file size, which doesn't work well in a CI environment). This changeset also ensures that model-outputs are synced to multiple places in the target S3 bucket: model-outputs/ and raw/model-outputs. Once the parquet conversion feature is implemented, raw/model-outputs will reflect the data in a hub's repo, and model-outputs will be the "user-facing" parquet files.

bsweger · 2024-03-08T19:50:45Z

hubverse-aws-upload/hubverse-aws-upload.yaml

+            rclone sync \
+              "./$DIRECTORY/" \
+              ":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \
+              --checksum --verbose --stats-one-line --config=/dev/null


--checksum instructs rclone to use file size + hash/checksum when determining whether or not a file has changed (instead of file size + update date)

--config=/dev/null is a hack to suppress output about not finding a config file (we don't use one here, since we're not using any special setup options)

more info:
https://rclone.org/commands/rclone_sync/#copy-options

bsweger · 2024-03-08T19:53:38Z

hubverse-aws-upload/hubverse-aws-upload.yaml

+          then
+            rclone sync \
+              "./$DIRECTORY/" \
+              ":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \


This line is an rclone S3 connection string. env_auth instructs rclone to get AWS creds via environment variables. Those AWS env variables are set in the prior Configure AWS credentials step.

bsweger · 2024-03-08T19:54:41Z

hubverse-aws-upload/hubverse-aws-upload.yaml

        done
+        # unlike other data, model-outputs are synced to a "raw" location


Added an explicit step for model-outputs, so we can land those files in a separate location (because we want to do some transforms on them).

Is it a default behavior for all the hubs?

It is, with a plan to "fast follow" with the parquet conversion functionality that will put transformed data into s3://hub-bucket-name/model-output

Happy to rename it to something better...raw is a nomenclature commonly used in corporate data lakes, so that's what popped out of my head 😄

I originally had the idea to copy the original files to both s3://hub-bucket-name/raw/model-output and s3://hub-bucket-name/model-output until the parquet transformation is in place, but then realized that would just be creating a bunch of data in the latter location that would require cleanup.

I am ok with the name, I just wonder because some hubs might already be in a parquet format so it will not be necessary to copy it to raw and than transform it to parquet. It should be directly copied to s3://hub-bucket-name/model-output, no? (for example, the US SMH RSV Hub, is already in parquet format).
So I wonder if it makes sense to have it as a default behavior.

Ah I see--thanks for clarifying!

Nick had the same question, and my .02: we should always have a place like raw to store the model-outputs as they were originally submitted: hubverse-org/hubverse-cloud#20 (comment)

On second thought, syncing the model-output folder to both s3://hub-bucket/model-output AND s3://hub-bucket/raw/model-output will just create cruft to cleanup once the parquet conversions are online, so don't do that.

LucieContamin

It looks all good to me! I just have a question about the copy to a "raw" folder. Sorry If I miss that but is it a "default" behavior and do we already have a plan for the transformation steps that will follow?

LucieContamin · 2024-03-13T18:38:41Z

hubverse-aws-upload/hubverse-aws-upload.yaml

        done
+        # unlike other data, model-outputs are synced to a "raw" location


Is it a default behavior for all the hubs?

bsweger · 2024-03-14T22:18:27Z

@LucieContamin Thanks for the 👀 , much appreciated. I'd love to make the cloud development more of a collaboration. Anna is off the project for a few weeks, and Matt is out this week, so if you wanna zoom on any of this, please let me know!

LucieContamin

Looks good! Thank you!

bsweger added 2 commits March 8, 2024 14:37

Bump actions/checkout version

edc6264

bsweger assigned bsweger and unassigned bsweger Mar 8, 2024

bsweger commented Mar 8, 2024

View reviewed changes

bsweger requested a review from annakrystalli March 8, 2024 19:55

Don't sync model outputs to multiple locations

9b92b08

On second thought, syncing the model-output folder to both s3://hub-bucket/model-output AND s3://hub-bucket/raw/model-output will just create cruft to cleanup once the parquet conversions are online, so don't do that.

LucieContamin reviewed Mar 13, 2024

View reviewed changes

LucieContamin self-requested a review March 18, 2024 19:01

LucieContamin approved these changes Mar 18, 2024

View reviewed changes

bsweger merged commit 327f847 into main Mar 18, 2024

bsweger added this to the hubverse cloud sync milestone Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bsweger/switch hubverse aws sync to rclone #13

Bsweger/switch hubverse aws sync to rclone #13

bsweger commented Mar 8, 2024 •

edited

Loading

bsweger Mar 8, 2024

bsweger Mar 8, 2024

bsweger Mar 8, 2024

LucieContamin Mar 13, 2024

bsweger Mar 14, 2024

LucieContamin Mar 18, 2024

bsweger Mar 18, 2024

LucieContamin left a comment

LucieContamin Mar 13, 2024

bsweger commented Mar 14, 2024

LucieContamin left a comment

		done
		# unlike other data, model-outputs are synced to a "raw" location

Bsweger/switch hubverse aws sync to rclone #13

Bsweger/switch hubverse aws sync to rclone #13

Conversation

bsweger commented Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucieContamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsweger commented Mar 14, 2024

LucieContamin left a comment

Choose a reason for hiding this comment

bsweger commented Mar 8, 2024 •

edited

Loading