-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/switch hubverse aws sync to rclone #13
Conversation
Resolves hubverse-org/hubverse-cloud#36 Use rclone for cloud syncing because it has a checksum option for detecting file differences (whereas AWS SYNC relies on update date and file size, which doesn't work well in a CI environment). This changeset also ensures that model-outputs are synced to multiple places in the target S3 bucket: model-outputs/ and raw/model-outputs. Once the parquet conversion feature is implemented, raw/model-outputs will reflect the data in a hub's repo, and model-outputs will be the "user-facing" parquet files.
rclone sync \ | ||
"./$DIRECTORY/" \ | ||
":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \ | ||
--checksum --verbose --stats-one-line --config=/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--checksum
instructs rclone to use file size + hash/checksum when determining whether or not a file has changed (instead of file size + update date)
--config=/dev/null
is a hack to suppress output about not finding a config file (we don't use one here, since we're not using any special setup options)
more info:
https://rclone.org/commands/rclone_sync/#copy-options
then | ||
rclone sync \ | ||
"./$DIRECTORY/" \ | ||
":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is an rclone S3 connection string. env_auth
instructs rclone to get AWS creds via environment variables. Those AWS env variables are set in the prior Configure AWS credentials
step.
done | ||
# unlike other data, model-outputs are synced to a "raw" location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explicit step for model-outputs, so we can land those files in a separate location (because we want to do some transforms on them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a default behavior for all the hubs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, with a plan to "fast follow" with the parquet conversion functionality that will put transformed data into s3://hub-bucket-name/model-output
Happy to rename it to something better...raw
is a nomenclature commonly used in corporate data lakes, so that's what popped out of my head 😄
I originally had the idea to copy the original files to both s3://hub-bucket-name/raw/model-output
and s3://hub-bucket-name/model-output
until the parquet transformation is in place, but then realized that would just be creating a bunch of data in the latter location that would require cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with the name, I just wonder because some hubs might already be in a parquet format so it will not be necessary to copy it to raw
and than transform it to parquet
. It should be directly copied to s3://hub-bucket-name/model-output
, no? (for example, the US SMH RSV Hub, is already in parquet format).
So I wonder if it makes sense to have it as a default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see--thanks for clarifying!
Nick had the same question, and my .02: we should always have a place like raw
to store the model-outputs as they were originally submitted: hubverse-org/hubverse-cloud#20 (comment)
On second thought, syncing the model-output folder to both s3://hub-bucket/model-output AND s3://hub-bucket/raw/model-output will just create cruft to cleanup once the parquet conversions are online, so don't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks all good to me! I just have a question about the copy to a "raw" folder. Sorry If I miss that but is it a "default" behavior and do we already have a plan for the transformation steps that will follow?
done | ||
# unlike other data, model-outputs are synced to a "raw" location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a default behavior for all the hubs?
@LucieContamin Thanks for the 👀 , much appreciated. I'd love to make the cloud development more of a collaboration. Anna is off the project for a few weeks, and Matt is out this week, so if you wanna zoom on any of this, please let me know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you!
Resolves hubverse-org/hubverse-cloud#36
See individual commit messages for additional details:
Output of an example run that shows syncing the contents of a new
model-auxiliary
folder but ignoring content that has already been synced to S3 (e.g.,model-output
)Output of an example run that includes a file rename and a file delete.