gsutil -m rsync -r -d gs://bucket s3://bucket
But we wanna auto sync between them, right?
This works better to sync huge files like Cloud SQL backups between GCS and S3. A cloud scheduler will call the cloud.run container which will use rclone to sync the buckets. You can see the original article with this idea and code here.
export PROJECT_ID=`gcloud config get-value core/project`
export PROJECT_NUMBER=`gcloud projects describe $PROJECT_ID --format="value(projectNumber)"`
export REGION=us-central1
export RSYNC_SERVER_SERVICE_ACCOUNT=rsync-sa@$PROJECT_ID.iam.gserviceaccount.com
export RSYNC_SRC=gcs-bucket-name
export RSYNC_DEST=s3-bucket-name
export AWS_ACCESS_KEY_ID=your-aws-key
export AWS_SECRET_ACCESS_ID=your-aws-secret
export AWS_REGION=your-region
gcloud iam service-accounts create rsync-sa --display-name "RSYNC Service Account" --project $PROJECT_ID
export SCHEDULER_SERVER_SERVICE_ACCOUNT=rsync-scheduler@$PROJECT_ID.iam.gserviceaccount.com
gcloud iam service-accounts create rsync-scheduler --display-name "RSYNC Scheduler Account" --project $PROJECT_ID
Configure Uniform Bucket Access Policy
gsutil iam ch serviceAccount:$RSYNC_SERVER_SERVICE_ACCOUNT:objectViewer gs://$RSYNC_SRC
The server.go
as an extra secondary check for the audience value that the Cloud Scheduler sends. This is not a necessary step since Cloud Run checks the audience value by itself automatically (see Authenticating service-to-service).
This secondary check is left in to accommodate running the service on any other platform.
To deploy, we first need to find out the URL for the Cloud Run instance.
First build and deploy the cloud run instance (dont' worry about the AUDIENCE
value below)
docker build -t gcr.io/$PROJECT_ID/rsync .
docker push gcr.io/$PROJECT_ID/rsync
gcloud beta run deploy rsync --image gcr.io/$PROJECT_ID/rsync \
--set-env-vars AUDIENCE="https://rsync-random-uc.a.run.app" \
--set-env-vars GS=$RSYNC_SRC \
--set-env-vars S3=$RSYNC_DEST \
--set-env-vars AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--set-env-vars AWS_SECRET_ACCESS_ID=$AWS_SECRET_ACCESS_ID \
--set-env-vars AWS_REGION=$AWS_REGION \
--region $REGION --platform=managed \
--no-allow-unauthenticated \
--service-account $RSYNC_SERVER_SERVICE_ACCOUNT
Get the URL and redeploy
export AUDIENCE=`gcloud beta run services describe rsync --platform=managed --region=$REGION --format="value(status.address.url)"`
gcloud beta run deploy rsync --image gcr.io/$PROJECT_ID/rsync \
--set-env-vars AUDIENCE="$AUDIENCE" \
--set-env-vars GS=$RSYNC_SRC \
--set-env-vars S3=$RSYNC_DEST \
--set-env-vars AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--set-env-vars AWS_SECRET_ACCESS_ID=$AWS_SECRET_ACCESS_ID \
--set-env-vars AWS_REGION=$AWS_REGION \
--region $REGION --platform=managed \
--no-allow-unauthenticated \
--service-account $RSYNC_SERVER_SERVICE_ACCOUNT
Configure IAM permissions for the Scheduler to invoke Cloud Run:
gcloud run services add-iam-policy-binding rsync --region $REGION --platform=managed \
--member=serviceAccount:$SCHEDULER_SERVER_SERVICE_ACCOUNT \
--role=roles/run.invoker
First allow Cloud Scheduler to assume its own service accounts OIDC Token:
envsubst < "bindings.tmpl" > "bindings.json"
Where the bindings file will have the root service account for Cloud Scheduler:
- bindings.tmpl:
{
"bindings": [
{
"members": [
"serviceAccount:service-$PROJECT_NUMBER@gcp-sa-cloudscheduler.iam.gserviceaccount.com"
],
"role": "roles/cloudscheduler.serviceAgent"
}
],
}
Assign the IAM permission and schedule the JOB to execute every 5mins:
gcloud iam service-accounts set-iam-policy $SCHEDULER_SERVER_SERVICE_ACCOUNT bindings.json -q
gcloud beta scheduler jobs create http rsync-schedule --schedule "0 1 * * *" \
--http-method=GET \
--uri=$AUDIENCE \
--oidc-service-account-email=$SCHEDULER_SERVER_SERVICE_ACCOUNT \
--oidc-token-audience=$AUDIENCE
The rclone way works fine, but it's expensive to sync everything. The below method will sync only new files when a new object is created or modificated in the GCS bucket. This method is entire "as is" created and described in this repo.
# Name of your GCP project
PROJECT=my-gcp-project
# Name for the runtime config (this MUST match the bucket name)
CONFIG_NAME=my-source-bucket
# AWS region in which your S3 bucket was created
S3_REGION=us-east-1
# Name of the S3 bucket
S3_TARGET_BUCKET=my-target-bucket
# Name for your Cloud Function
CLOUD_FUNCTION_NAME=syncMyBucket
# GCS Bucket where Cloud Function zip files are stored.
GCS_STAGING_BUCKET=my-cloud-function-bucket
# GCS source bucket to be synced
GCS_SOURCE_BUCKET=my-source-bucket
gcloud --project $PROJECT beta runtime-config configs create $CONFIG_NAME
gcloud --project $PROJECT beta runtime-config configs variables set aws-access-key $AWS_ACCESS_KEY_ID --config-name=$CONFIG_NAME
gcloud --project $PROJECT beta runtime-config configs variables set aws-secret-key $AWS_SECRET_ACCESS_KEY --config-name=$CONFIG_NAME
gcloud --project $PROJECT beta runtime-config configs variables set aws-region $S3_REGION --config-name=$CONFIG_NAME
gcloud --project $PROJECT beta runtime-config configs variables set aws-bucket $S3_TARGET_BUCKET --config-name=$CONFIG_NAME
gcloud --project $PROJECT beta functions deploy $CLOUD_FUNCTION_NAME --stage-bucket $GCS_STAGING_BUCKET \
--trigger-event providers/cloud.storage/eventTypes/object.change \
--trigger-resource $GCS_SOURCE_BUCKET \
--entry-point syncGCS --runtime nodejs10 \
--set-env-vars GCLOUD_PROJECT=$PROJECT