Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/switch hubverse aws sync to rclone #13

Merged
merged 3 commits into from
Mar 18, 2024

Conversation

bsweger
Copy link
Contributor

@bsweger bsweger commented Mar 8, 2024

Resolves hubverse-org/hubverse-cloud#36

See individual commit messages for additional details:

  • bump actions/checkout version
  • swap the utility we use for S3 syncing

Output of an example run that shows syncing the contents of a new model-auxiliary folder but ignoring content that has already been synced to S3 (e.g., model-output)

Run hub_directories=(
2024/03/12 19:29:57 INFO  : data.csv: Copied (new)
2024/03/12 19:29:57 INFO  : data.txt: Copied (new)
2024/03/12 19:29:57 INFO  :     2.654 KiB / 2.654 KiB, 100%, 0 B/s, ETA -
2024/03/12 19:29:57 INFO  : There was nothing to transfer
2024/03/12 19:29:57 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:58 INFO  : There was nothing to transfer
2024/03/12 19:29:58 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:58 INFO  : There was nothing to transfer
2024/03/12 19:29:58 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:29:59 INFO  : There was nothing to transfer
2024/03/12 19:29:59 INFO  :           0 B / 0 B, -, 0 B/s, ETA -

Output of an example run that includes a file rename and a file delete.

2024/03/12 19:38:17 INFO  : cool-data.csv: Copied (new)
2024/03/12 19:38:17 INFO  : data.csv: Deleted
2024/03/12 19:38:17 INFO  : data.txt: Deleted
2024/03/12 19:38:17 INFO  :     2.630 KiB / 2.630 KiB, 100%, 0 B/s, ETA -
2024/03/12 19:38:17 INFO  : There was nothing to transfer
2024/03/12 19:38:17 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:17 INFO  : There was nothing to transfer
2024/03/12 19:38:17 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:18 INFO  : There was nothing to transfer
2024/03/12 19:38:18 INFO  :           0 B / 0 B, -, 0 B/s, ETA -
2024/03/12 19:38:18 INFO  : There was nothing to transfer
2024/03/12 19:38:18 INFO  :           0 B / 0 B, -, 0 B/s, ETA -

Resolves hubverse-org/hubverse-cloud#36

Use rclone for cloud syncing because it has a checksum option for
detecting file differences (whereas AWS SYNC relies on update date
and file size, which doesn't work well in a CI environment).

This changeset also ensures that model-outputs are synced to multiple
places in the target S3 bucket: model-outputs/ and raw/model-outputs.
Once the parquet conversion feature is implemented, raw/model-outputs
will reflect the data in a hub's repo, and model-outputs will be the
"user-facing" parquet files.
@bsweger bsweger assigned bsweger and unassigned bsweger Mar 8, 2024
rclone sync \
"./$DIRECTORY/" \
":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \
--checksum --verbose --stats-one-line --config=/dev/null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--checksum instructs rclone to use file size + hash/checksum when determining whether or not a file has changed (instead of file size + update date)

--config=/dev/null is a hack to suppress output about not finding a config file (we don't use one here, since we're not using any special setup options)

more info:
https://rclone.org/commands/rclone_sync/#copy-options

then
rclone sync \
"./$DIRECTORY/" \
":s3,provider=AWS,env_auth:$BUCKET_NAME/$DIRECTORY" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is an rclone S3 connection string. env_auth instructs rclone to get AWS creds via environment variables. Those AWS env variables are set in the prior Configure AWS credentials step.

done
# unlike other data, model-outputs are synced to a "raw" location
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit step for model-outputs, so we can land those files in a separate location (because we want to do some transforms on them).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a default behavior for all the hubs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, with a plan to "fast follow" with the parquet conversion functionality that will put transformed data into s3://hub-bucket-name/model-output

Happy to rename it to something better...raw is a nomenclature commonly used in corporate data lakes, so that's what popped out of my head 😄

I originally had the idea to copy the original files to both s3://hub-bucket-name/raw/model-output and s3://hub-bucket-name/model-output until the parquet transformation is in place, but then realized that would just be creating a bunch of data in the latter location that would require cleanup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the name, I just wonder because some hubs might already be in a parquet format so it will not be necessary to copy it to raw and than transform it to parquet. It should be directly copied to s3://hub-bucket-name/model-output, no? (for example, the US SMH RSV Hub, is already in parquet format).
So I wonder if it makes sense to have it as a default behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see--thanks for clarifying!

Nick had the same question, and my .02: we should always have a place like raw to store the model-outputs as they were originally submitted: hubverse-org/hubverse-cloud#20 (comment)

On second thought, syncing the model-output folder to both
s3://hub-bucket/model-output AND s3://hub-bucket/raw/model-output
will just create cruft to cleanup once the parquet conversions
are online, so don't do that.
Copy link

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks all good to me! I just have a question about the copy to a "raw" folder. Sorry If I miss that but is it a "default" behavior and do we already have a plan for the transformation steps that will follow?

done
# unlike other data, model-outputs are synced to a "raw" location

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a default behavior for all the hubs?

@bsweger
Copy link
Contributor Author

bsweger commented Mar 14, 2024

@LucieContamin Thanks for the 👀 , much appreciated. I'd love to make the cloud development more of a collaboration. Anna is off the project for a few weeks, and Matt is out this week, so if you wanna zoom on any of this, please let me know!

@LucieContamin LucieContamin self-requested a review March 18, 2024 19:01
Copy link

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@bsweger bsweger merged commit 327f847 into main Mar 18, 2024
@bsweger bsweger added this to the hubverse cloud sync milestone Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch sync utility used in hubverse-aws-upload workflow
2 participants